# Double Deep Q-Network: DQN with stability upgrades

The Deep Q network sometimes suffers from various problems. The problems are presented here and a solution for these problems is presented.

Henrik Bartsch

## Classification

In a previous post, the principle of the Deep Q-Network was introduced. With more exact investigation of the network that the model suffers from overestimation, which makes the training unstable.

Overestimation describes the principle that expected rewards are predicted to be too high.

Overestimation reduces the quality of the training process and should be reduced or avoided.

Another explanation for problems in DQN is the fact that the so-called Q-Target is not a constant value.

As the Q-target we use the term $y_i = r_i + \gamma max_{a'_i} Q(s'_i, a'_i)$ in $L = \frac{1}{N} \sum_{i = 0}^{N - 1} (Q(s_i, a_i) - y_i)^2$ denotes.

By approximating a non-constant value, the approximation is fundamentally more unstable. A target network reduces the problem here. 1 2

Research in the field of artificial neural networks then achieved success in improving this algorithm starting in 2010. The result was called Double Deep Q-Network. referred to as Double Deep Q-Network. In contrast to Deep Q-Networks, it does not use Frozen Target-Networks to further reduce the overestimation mentioned above.

In the following, this algorithm can also be referred to as D2QN.

## Versions of the algorithm

The following information about the different types of the algorithm was taken from the source 3.

The basic principle of the algorithm is that a combinnation of two different networks is used to reduce the overestimation. This is done by training the two networks are trained with each other, and thus get the bias out of the network updates.

The paper on this can be viewed here.

### Hasselt, 2010

[Double Q-Learning: Extracted from 4(https://miro.medium.com/max/640/1*NvvRn59pz-D1iSkBWpuIxA.png)

The original 2010 algorithm involves two different networks, which are selected in a $\epsilon$-greedy scheme (with $\epsilon = 0.5$). At each time step, a random network is selected, and the update is subsequently fitted using the mean squared error of the difference in the prediction.

The problem with this implementation (compared to the newer alternatives) is that, in theory, only $50 \%$ of the generated information goes to each network, the other 50 % are not used for updates of the other network. This significantly reduces the Sample Efficiency of the algorithm. Furthermore the problem of Overestimation can be can be improved even further.

Sample Efficieny describes how efficiently an algorithm can learn from the given information; an efficient algorithm will get by with significantly fewer episodes (samples).

### Hasselt, 2015

[Double Q-learning: Extracted from 4(https://miro.medium.com/max/640/1*4B46Bc9EDUdwrnqhAUp7hQ.png)

The 2015 algorithm represents a major milestone in the development of the DDQN algorithm. Here, a primary and target network is introduced for the first time. These two networks are defined here only as partially independent, both are initialized at the beginning also classically with the identical weights.

The Primary-Network represents the network, which is responsible for the selection of the actions depending on the current state.

The Target-Network represents the network, which prevents the overestimation. It represents an “older state” of the network and prevents the rapid forgetting of information, which has already been learned by the primary network. The target network thus “evaluates” the selected action.

An important change here is the update of the networks. The Target-Q-Value for the primary network is determined by the prediction of the target network and is fitted via Mean Squared Error, while we perform a soft update on the target network:

$\theta' \leftarrow \tau \theta + (1 - \tau) \theta'$

Hierbei gilt für die Konstante $\tau$ klassischerweise die folgende Beschränkung: $\tau \in (0, 1)$. Für den Grenzfall $\tau = 1$ erhält man zwei identische Netzwerke, für $\tau = 0$ erhält man kein Update auf dem Target-Network. Hierdurch würde es nicht mehr lernen.

Das Original-Paper zu diesem Algorithmus von Hasselt ist hier zu finden.

### Fujimoto et al., 2018

A small improvement was subsequently still released in 2018. One can still reduce the overestimation by using the minimum prediction from the two networks to to calculate the target-Q-value. Otherwise, no difference exists here compared to the 2015 release.

## Implementation

Among the imports there is nothing new, they are identical to the implementation of Deep Q-Network. Also the ExperienceMemory is identical.

In the DDQNAgent class you might quickly notice a few changes. Here is a clear separation between primary_network and target_network, which must be initialized identically. must be initialized. Additionally the soft-update-parameter $\tau \in [0, 1]$ was defined here. Otherwise no change is found here.

With the training function one finds directly the change to the Deep Q-Network, which is found in the calculation of the Q-Target. Here this time not only primary_network or target_network is used to calculate the Q-target, but in each case component-wise the minimum of the predictions. Subsequently, the primary_network is fitted on the Q-target and the target_network is updated via a soft update.

The training and evaluation loops are based on the standard from the Deep Q-Network’s implementation.

As a result, you get, for example, the following diagram:

Note: Training performance can sometimes differ greatly from device to device and corresponding seeds. General reproducibility of such results can generally not be guaranteed.

It can be clearly seen that the D2QN trains significantly better than the DQN in the beginning. Later, however, the success flattens out, which is probably due to a poor setting of the hyperparameters. of the hyperparameters.

For comparison, two networks of similar structure were used; both networks (dqn_network and primary_network) had the same number of neurons. This does not have to be This does not have to be purposeful in practical use, but is only valid here for comparability.

## Changes

[23.01.2023] Introduction of interactive plots, reference to non-reproducibility of results