## Classification

In a previous post, the principle of the Deep Q-Network was introduced. With more exact investigation of the network
that the model suffers from *overestimation*, which makes the training unstable.

Overestimation describes the principle that expected rewards are predicted to be too high.

Overestimation reduces the quality of the training process and should be reduced or avoided.

Another explanation for problems in DQN is the fact that the so-called *Q-Target* is not a constant value.

As the

Q-targetwe use the term $y_i = r_i + \gamma max_{a'_i} Q(s'_i, a'_i)$ in $L = \frac{1}{N} \sum_{i = 0}^{N - 1} (Q(s_i, a_i) - y_i)^2$ denotes.

By approximating a non-constant value, the approximation is fundamentally more unstable. A target network reduces the problem here. ^{1} ^{2}

Research in the field of *artificial neural networks* then achieved success in improving this algorithm starting in 2010. The result was called *Double Deep Q-Network*.
referred to as *Double Deep Q-Network*. In contrast to Deep Q-Networks, it does not use *Frozen Target-Networks* to further reduce the overestimation mentioned above.

In the following, this algorithm can also be referred to as D2QN.

## Versions of the algorithm

The following information about the different types of the algorithm was taken from the source ^{3}.

The basic principle of the algorithm is that a combinnation of two different networks is used to reduce the overestimation. This is done by training the
two networks are trained with each other, and thus get the *bias* out of the network updates.

The paper on this can be viewed here.

### Hasselt, 2010

[Double Q-Learning: Extracted from ^{4}(https://miro.medium.com/max/640/1*NvvRn59pz-D1iSkBWpuIxA.png)

The original 2010 algorithm involves two different networks, which are selected in a $\epsilon$-greedy scheme (with $\epsilon = 0.5$). At each
time step, a random network is selected, and the update is subsequently fitted using the *mean squared error* of the difference in the prediction.

The problem with this implementation (compared to the newer alternatives) is that, in theory, only $50 \%$ of the generated information goes to each network, the
other 50 % are not used for updates of the other network. This significantly reduces the *Sample Efficiency* of the algorithm. Furthermore the problem of *Overestimation* can be
can be improved even further.

Sample Efficieny describes how efficiently an algorithm can learn from the given information; an efficient algorithm will get by with significantly fewer episodes (samples).

### Hasselt, 2015

[Double Q-learning: Extracted from ^{4}(https://miro.medium.com/max/640/1*4B46Bc9EDUdwrnqhAUp7hQ.png)

The 2015 algorithm represents a major milestone in the development of the *DDQN* algorithm. Here, a *primary* and *target* network is introduced for the first time.
These two networks are defined here only as partially independent, both are initialized at the beginning also classically with the identical weights.

The Primary-Network represents the network, which is responsible for the selection of the actions depending on the current state.

The Target-Network represents the network, which prevents the overestimation. It represents an “older state” of the network and prevents the rapid forgetting of information, which has already been learned by the primary network. The target network thus “evaluates” the selected action.

An important change here is the update of the networks. The *Target-Q-Value* for the primary network is determined by the prediction of the target network and is fitted via Mean Squared
Error, while we perform a *soft update* on the target network:

$\theta' \leftarrow \tau \theta + (1 - \tau) \theta'$

Hierbei gilt für die Konstante $\tau$ klassischerweise die folgende Beschränkung: $\tau \in (0, 1)$. Für den Grenzfall $\tau = 1$ erhält man zwei identische Netzwerke, für $\tau = 0$
erhält man *kein* Update auf dem Target-Network. Hierdurch würde es nicht mehr lernen.

Das Original-Paper zu diesem Algorithmus von Hasselt ist hier zu finden.

### Fujimoto et al., 2018

Clipped Double Q-learning: extracted from ^{}4.

A small improvement was subsequently still released in 2018. One can still reduce the overestimation by using the minimum prediction from the two networks to to calculate the target-Q-value. Otherwise, no difference exists here compared to the 2015 release.

## Implementation

Among the imports there is nothing new, they are identical to the implementation of Deep Q-Network. Also the
`ExperienceMemory`

is identical.

In the `DDQNAgent`

class you might quickly notice a few changes. Here is a clear separation between `primary_network`

and `target_network`

, which must be initialized identically.
must be initialized. Additionally the *soft-update-parameter* $\tau \in [0, 1]$ was defined here. Otherwise no change is found here.

With the training function one finds directly the change to the Deep Q-Network, which is found in the calculation of the Q-Target. Here this time not only `primary_network`

or `target_network`

is
used to calculate the Q-target, but in each case component-wise the minimum of the predictions. Subsequently, the `primary_network`

is fitted on the Q-target and the `target_network`

is updated via a
soft update.

The training and evaluation loops are based on the standard from the *Deep Q-Network*’s implementation.

As a result, you get, for example, the following diagram:

Note: Training performance can sometimes differ greatly from device to device and corresponding seeds. General reproducibility of such results can generally not be guaranteed.

It can be clearly seen that the D2QN trains significantly better than the DQN in the beginning. Later, however, the success flattens out, which is probably due to a poor setting of the hyperparameters. of the hyperparameters.

For comparison, two networks of similar structure were used; both networks (

`dqn_network`

and`primary_network`

) had the same number of neurons. This does not have to be This does not have to be purposeful in practical use, but is only valid here for comparability.

## Changes

[23.01.2023] Introduction of interactive plots, reference to non-reproducibility of results