Grafik von JJ Ying –

Double Deep Q-Network: DQN with stability upgrades

The Deep Q network sometimes suffers from various problems. The problems are presented here and a solution for these problems is presented.

Henrik Bartsch

Henrik Bartsch


In a previous post, the principle of the Deep Q-Network was introduced. With more exact investigation of the network that the model suffers from overestimation, which makes the training unstable.

Overestimation describes the principle that expected rewards are predicted to be too high.

Overestimation reduces the quality of the training process and should be reduced or avoided.

Another explanation for problems in DQN is the fact that the so-called Q-Target is not a constant value.

As the Q-target we use the term yi=ri+γmaxaiQ(si,ai)y_i = r_i + \gamma max_{a'_i} Q(s'_i, a'_i) in L=1Ni=0N1(Q(si,ai)yi)2L = \frac{1}{N} \sum_{i = 0}^{N - 1} (Q(s_i, a_i) - y_i)^2 denotes.

By approximating a non-constant value, the approximation is fundamentally more unstable. A target network reduces the problem here. 1 2

Research in the field of artificial neural networks then achieved success in improving this algorithm starting in 2010. The result was called Double Deep Q-Network. referred to as Double Deep Q-Network. In contrast to Deep Q-Networks, it does not use Frozen Target-Networks to further reduce the overestimation mentioned above.

In the following, this algorithm can also be referred to as D2QN.

Versions of the algorithm

The following information about the different types of the algorithm was taken from the source 3.

The basic principle of the algorithm is that a combinnation of two different networks is used to reduce the overestimation. This is done by training the two networks are trained with each other, and thus get the bias out of the network updates.

The paper on this can be viewed here.

Hasselt, 2010

[Double Q-Learning: Extracted from 4(*NvvRn59pz-D1iSkBWpuIxA.png)

The original 2010 algorithm involves two different networks, which are selected in a ϵ\epsilon-greedy scheme (with ϵ=0.5\epsilon = 0.5). At each time step, a random network is selected, and the update is subsequently fitted using the mean squared error of the difference in the prediction.

The problem with this implementation (compared to the newer alternatives) is that, in theory, only 50%50 \% of the generated information goes to each network, the other 50 % are not used for updates of the other network. This significantly reduces the Sample Efficiency of the algorithm. Furthermore the problem of Overestimation can be can be improved even further.

Sample Efficieny describes how efficiently an algorithm can learn from the given information; an efficient algorithm will get by with significantly fewer episodes (samples).

Hasselt, 2015

[Double Q-learning: Extracted from 4(*4B46Bc9EDUdwrnqhAUp7hQ.png)

The 2015 algorithm represents a major milestone in the development of the DDQN algorithm. Here, a primary and target network is introduced for the first time. These two networks are defined here only as partially independent, both are initialized at the beginning also classically with the identical weights.

The Primary-Network represents the network, which is responsible for the selection of the actions depending on the current state.

The Target-Network represents the network, which prevents the overestimation. It represents an “older state” of the network and prevents the rapid forgetting of information, which has already been learned by the primary network. The target network thus “evaluates” the selected action.

An important change here is the update of the networks. The Target-Q-Value for the primary network is determined by the prediction of the target network and is fitted via Mean Squared Error, while we perform a soft update on the target network:

θτθ+(1τ)θ\theta' \leftarrow \tau \theta + (1 - \tau) \theta'

Hierbei gilt für die Konstante τ\tau klassischerweise die folgende Beschränkung: τ(0,1)\tau \in (0, 1). Für den Grenzfall τ=1\tau = 1 erhält man zwei identische Netzwerke, für τ=0\tau = 0 erhält man kein Update auf dem Target-Network. Hierdurch würde es nicht mehr lernen.

Das Original-Paper zu diesem Algorithmus von Hasselt ist hier zu finden.

Fujimoto et al., 2018

Clipped Double Q-learning: extracted from 4.

A small improvement was subsequently still released in 2018. One can still reduce the overestimation by using the minimum prediction from the two networks to to calculate the target-Q-value. Otherwise, no difference exists here compared to the 2015 release.


Among the imports there is nothing new, they are identical to the implementation of Deep Q-Network. Also the ExperienceMemory is identical.
import datetime
import math
import gym

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from collections import deque
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, InputLayer
from tensorflow.python.keras.optimizer_v2.adam import Adam
class ExperienceMemory:
    def __init__(self, memory_length: int=10000, batch_size: int=64, nonunique_experience: bool=True):
        self.memory_length = memory_length
        self.batch_size = batch_size
        self.nonunique_experience = nonunique_experience

        # ** Initialize replay memory D to capacity N **
        self.cstate_memory = deque([], maxlen=self.memory_length)
        self.action_memory = deque([], maxlen=self.memory_length)
        self.reward_memory = deque([], maxlen=self.memory_length)
        self.done_memory = deque([], maxlen=self.memory_length)
        self.pstate_memory = deque([], maxlen=self.memory_length)

    def record(self, cstate, action, reward, done, pstate):
        # Convert states into processable objects
        cstate = tf.expand_dims(tf.convert_to_tensor(cstate), axis=0)
        pstate = tf.expand_dims(tf.convert_to_tensor(pstate), axis=0)

        # Save data

    def return_experience(self):
        # Retrieve experience
        batch_indices = np.random.choice(len(self.cstate_memory), size=self.batch_size, replace=self.nonunique_experience)
        batch_cstate = np.array(list(self.cstate_memory))[batch_indices, :]
        batch_action = np.take(list(self.action_memory), batch_indices)
        batch_reward = np.take(list(self.reward_memory), batch_indices)
        batch_done = np.take(list(self.done_memory), batch_indices)
        batch_pstate = np.array(list(self.pstate_memory))[batch_indices, :]

        # Convert experience into respective tensorflow tensors
        cstates = tf.squeeze(tf.convert_to_tensor(batch_cstate))
        actions = tf.convert_to_tensor(batch_action)
        rewards = tf.convert_to_tensor(batch_reward, dtype=tf.float32)
        dones = tf.convert_to_tensor(batch_done, dtype=tf.int64)
        pstates = tf.squeeze(tf.convert_to_tensor(batch_pstate))

        return (cstates, actions, rewards, dones, pstates)

    def flush_memory(self):
class DDQNAgent:
  def __init__(self, observation_size, action_size):
    self.observation_size = observation_size
    self.action_size = action_size

    # Network Parameters
    self.alpha = math.pow(10, -3)
    self.gamma = 0.97
    self.epsilon = 0.79
    self.min_epsilon = 0.05
    self.epsilon_decay = 0.85
    self.tau = 0.005

    # Experience Replay Parameters
    self.memory_length = 26000
    self.batch_size = 64
    self.nonunique_experience = True
    # ** Initialize replay memory D to capacity N **
    self.experience_memory = ExperienceMemory(self.memory_length, self.batch_size, self.nonunique_experience)

    # ** Initialize action-value function Q with random weights **
    self.primary_model = Sequential([
            InputLayer(input_shape=self.observation_size, name="Input_Layer"),
            Dense(units=128, activation='relu', name='Hidden_Layer_1'),
            Dense(units=self.action_size, activation='linear', name='Output_Layer')], name="Primary-Q-Network")
    self.primary_model.compile(loss="mse", optimizer=Adam(learning_rate=self.alpha))

    self.target_model = Sequential([
            InputLayer(input_shape=self.observation_size, name="Input_Layer"),
            Dense(units=128, activation='relu', name='Hidden_Layer_1'),
            Dense(units=self.action_size, activation='linear', name='Output_Layer')], name="Target-Q-Network")

    # Define metrics for tensorboard
    self.score_log_path = 'logs/' +"%Y%m%d-%H%M%S")
    self.score_writer = tf.summary.create_file_writer(self.score_log_path)

  def compute_action(self, observation, evaluation: bool=False):
    if (np.random.uniform(0, 1) < self.epsilon and evaluation == False):
      return np.random.choice(range(self.action_size)) # ** With probability epsilon select a random action a_t **
      observation = np.expand_dims(observation, axis=0)
      Q_values = self.primary_model([observation])
      return np.argmax(Q_values) # ** Otherwise select a_t = argmax(Q) **

  def update_epsilon_parameter(self):
    self.epsilon = max(self.epsilon * math.exp(-self.epsilon_decay), self.min_epsilon)

  def store_step(self, cstate, action, reward, done, pstate):
    self.experience_memory.record(cstate, action, reward, done, pstate)

In the DDQNAgent class you might quickly notice a few changes. Here is a clear separation between primary_network and target_network, which must be initialized identically. must be initialized. Additionally the soft-update-parameter τ[0,1]\tau \in [0, 1] was defined here. Otherwise no change is found here.
  def train(self): # **For each update step**
    # ** Sample random minibatch of transitions from D**
    b_cstate, b_action, b_reward, b_done, b_pstate = self.experience_memory.return_experience()

    # ** Set y_j **
    prediction_primary_model = self.primary_model(b_pstate)
    prediction_target_model = self.target_model(b_pstate)
    q_value_network = np.zeros((self.batch_size,))

    b_done = b_done.numpy()
    max_primary_pred = np.max(prediction_primary_model, axis=1)
    max_target_pred = np.max(prediction_target_model, axis=1)
    min_pred = np.min(np.transpose(np.squeeze([max_primary_pred, max_target_pred])))

    q_value_network = (1 - b_done) * self.gamma * min_pred
    target_q_p = np.add(b_reward, q_value_network)

    # **Perform gradient descent step on Q**
    self.primary_model.train_on_batch(b_cstate, target_q_p)

    # Perform soft updates
    update_target(self.target_model.variables, self.primary_model.variables, self.tau)

With the training function one finds directly the change to the Deep Q-Network, which is found in the calculation of the Q-Target. Here this time not only primary_network or target_network is used to calculate the Q-target, but in each case component-wise the minimum of the predictions. Subsequently, the primary_network is fitted on the Q-target and the target_network is updated via a soft update.
def update_target(target_weights, weights, tau: float):
    for (target_weight, primary_weight) in zip(target_weights, weights):
        target_weight.assign(primary_weight * tau + target_weight * (1 - tau))
def training_loop(env, agent: DDQNAgent, max_frames_episode: int):
  current_obs, _ = env.reset()

  episode_reward = 0
  for j in range(max_frames_episode):
    action = agent.compute_action(current_obs)

    next_obs, reward, done, _, _ = env.step(action)
    next_obs = np.array(next_obs)
    agent.store_step(current_obs, action, reward, done, next_obs)

    current_obs = next_obs
    episode_reward += reward

    if done:

  return episode_reward, agent

def evaluation_loop(env, agent: DDQNAgent, max_frames_episode: int):
  current_obs, _ = env.reset()
  evaluation_reward = 0
  for j in range(max_frames_episode):
    # ** Execute action a_t in emulator and observe reward r_t and image x_{t+1}
    action = agent.compute_action(current_obs, evaluation=True)

    next_obs, reward, done, _, _ = env.step(action)
    next_obs = np.array(next_obs)

    # **Storing all information about the last episode in the memory buffer**
    agent.store_step(current_obs, action, reward, done, next_obs)

    # Reseting environment information so that next episode can handle the previous information
    current_obs = next_obs
    evaluation_reward += reward

    if done:
  return evaluation_reward, agent

The training and evaluation loops are based on the standard from the Deep Q-Network’s implementation.
n_episodes, max_frames_episode, avg_length, evaluation_interval = 1000, 500, 50, 10
episodic_reward, avg_reward, evaluation_rewards = [], [], []

env = gym.make("CartPole-v1")
n_actions = env.action_space.n
observation_shape = env.observation_space.shape[0]


agent = DDQNAgent(observation_shape, n_actions)
eval_reward, episode_reward = 0, 0

for i in range(n_episodes):
    # **Observing state of the environment**
    episode_reward, agent = training_loop(env=env, agent=agent, max_frames_episode=max_frames_episode)
    print("Training Score in Episode {}: {}".format(i, episode_reward))

    if (i % evaluation_interval == 0):
      # Evaluation loop
      eval_reward, agent = evaluation_loop(env=env, agent=agent, max_frames_episode=max_frames_episode)
      print("Evaluation Score in Episode {}: {}".format(i, eval_reward))

    with agent.score_writer.as_default():
      tf.summary.scalar('Episodic Score', episode_reward, step=i)
      if (i % evaluation_interval == 0):
        tf.summary.scalar('Evaluation Score', eval_reward, step=i)

    avg_reward.append(avg_n(episodic_reward, n=avg_length))

As a result, you get, for example, the following diagram:

Note: Training performance can sometimes differ greatly from device to device and corresponding seeds. General reproducibility of such results can generally not be guaranteed.

It can be clearly seen that the D2QN trains significantly better than the DQN in the beginning. Later, however, the success flattens out, which is probably due to a poor setting of the hyperparameters. of the hyperparameters.

For comparison, two networks of similar structure were used; both networks (dqn_network and primary_network) had the same number of neurons. This does not have to be This does not have to be purposeful in practical use, but is only valid here for comparability.


[23.01.2023] Introduction of interactive plots, reference to non-reproducibility of results






  4. 2 3