Graphics from Deepmind –

Deep Recurrent Q Network - DQN with a look into the past

Deep Q-Networks sometimes need information from different time steps to converge quickly. The Deep Recurrent Q-Network represents one possibility for this.

Henrik Bartsch

Henrik Bartsch


A recent post explained the approach and implementation of Deep Q-Networks. However, DQN is just one of the basic algorithms in Reinforcement Learning, research in the field shows a number of possible improvements. One of them is Deep Recurrent Q-Network, which uses Recurrent Layers, to process more information at the same time and to better handle the interactions of different information.


Deep Q-Networks have a number of practical limitations. In the following, we will present the most important limitation, which could be solved by Deep Recurrent Q-Networks:

Suppose an algorithm is to learn the game Pong. Pong( is a competitive two-player arcade game in which two discs are used to interact with and move a ball through the playing field. The goal of each player is to move the ball in such a way that the other player cannot intercept the ball and leave the playing edge at the opponent; in this case, the player scores a point. One possible modeling here is to always pass the current observation (i.e., the state of the screen as an RGB array) to the agent. Furthermore, let’s assume we are in a fixed point in time and see the ball as in the picture below.

Pong Example - Graphic by Feelfarbig Magazine -

Now a question arises: how will the ball move forward depending on the current state? The answer is quickly clear - there is no deterministic answer to this question. To answer this question, an observer needs at least two time steps (and information about the width of the time step) to calculate a trajectory and velocity. So the idea was to define a Deep Q-Network as a functional model. A functional model here is a model that can have multiple inputs and outputs, so it does not necessarily map a classical graph, as is the case with the classical sequential model for example.

Tensorflow users who have already worked with Recurrent Models certainly know that this approach works, but is inefficient. Recurrent Models are better suited to solve such problems than their feed-forward counterparts because of their special structure.

Recurrent models receive at input (if defined) information from several time steps before the current time step. Internally in the Recurrent Layers the interaction between the individual time steps is modeled, in order to achieve better results with problems such as Sequence Forecasting or Time Forecasting.

As a frequently used example for Recurrent Layer one can see the Long Short-Term Memory-Layer, which was implemented in Keras and already presented 1997 by Sepp Hochreiter []. For recent publications dealing with LSTM, see for example at 1, 2.

The approach above was implemented even more difficult in 3. Here Pong was implemented, but with the peculiarity that every now and then frames were passed to the agent empty (i.e., completely without information). An algorithm like the Deep Q-Network, which is designed for simple state-to-state transitions, cannot make a meaningful decision here. A Deep Recurrent Q-Network, which accepts, for example, five time steps as input, can still perform meaningfully here because it has all the necessary information and can potentially interpolate the position of the ball. The authors also show that the Recurrent model prevails over the Feed-Forward model.


In this version, an implementation of a Deep Recurrent Neural Network, which includes LSTM layers.

Alternative layers for this type of task include the Gated Recurrent Unit (GRU) or the Simple Recurrent Neural Network layer (SimpleRNN).

The imports are identical to the implementation of a DQN:
import datetime
import math
import gym

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, InputLayer, LSTM
from tensorflow.python.keras.optimizer_v2.adam import Adam
from tensorflow.python.keras.losses import mean_squared_error
from tensorflow.python.keras.metrics import Mean
from tensorflow.python.keras.models import load_model, save_model

from collections import deque

An implementation of an outsourced Experience Replay is straight-forward:
class ExperienceMemory:
    def __init__(self, memory_length: int=10000, batch_size: int=64, nonunique_experience: bool=True):
        self.memory_length = memory_length
        self.batch_size = batch_size
        self.nonunique_experience = nonunique_experience

        # ** Initialize replay memory D to capacity N **
        self.cstate_memory = deque([], maxlen=self.memory_length)
        self.action_memory = deque([], maxlen=self.memory_length)
        self.reward_memory = deque([], maxlen=self.memory_length)
        self.done_memory = deque([], maxlen=self.memory_length)
        self.pstate_memory = deque([], maxlen=self.memory_length)

    def record(self, cstate, action, reward, done, pstate):
        # Convert states into processable objects
        cstate = tf.expand_dims(tf.convert_to_tensor(cstate), axis=0)
        pstate = tf.expand_dims(tf.convert_to_tensor(pstate), axis=0)

        # Save data

    def return_experience(self):
        # Retrieve experience
        batch_indices = np.random.choice(len(self.cstate_memory), size=self.batch_size, replace=self.nonunique_experience)
        batch_cstate = np.array(list(self.cstate_memory))[batch_indices, :]
        batch_action = np.take(list(self.action_memory), batch_indices)
        batch_reward = np.take(list(self.reward_memory), batch_indices)
        batch_done = np.take(list(self.done_memory), batch_indices)
        batch_pstate = np.array(list(self.pstate_memory))[batch_indices, :]

        # Convert experience into respective tensorflow tensors
        cstates = tf.squeeze(tf.convert_to_tensor(batch_cstate))
        actions = tf.convert_to_tensor(batch_action)
        rewards = tf.convert_to_tensor(batch_reward, dtype=tf.float32)
        dones = tf.convert_to_tensor(batch_done, dtype=tf.int64)
        pstates = tf.squeeze(tf.convert_to_tensor(batch_pstate))

        return (cstates, actions, rewards, dones, pstates)

    def flush_memory(self):

Note: The Experience in this memory is converted to Tensorflow tensors on output to be able to execute the training later as @tf.function. This kind of functionality is a conversion of the operation into a graph, so that contained operations can be executed faster.

The more detailed functionality of @tf.functions can be read here.

The actual agent can be implemented in the following way:
class DRQNAgent:
    def __init__(self, observation_size, action_shape, num_rounds):
        self.observation_size = observation_size
        self.action_shape = action_shape
        self.num_rounds = num_rounds

        # Network parameters
        self.alpha = 0.005
        self.gamma = 0.95
        self.epsilon = 0.8
        self.min_epsilon = 0.05
        self.epsilon_decay = 0.005

        self.batch_size = 64
        self.nonunique_experience = True
        self.memory_length = 10000

        # Internal memory for last states
        self.states = np.zeros((1, self.num_rounds, self.observation_size))

        # ** Initialize replay memory D to capacity N **
        self.experience_memory = ExperienceMemory(memory_length=self.memory_length, batch_size=self.batch_size, nonunique_experience=self.nonunique_experience)

        # ** Initialize action-value function Q with random weights **
        self.model = Sequential(name="DRQN")
        self.model.add(InputLayer(input_shape=(self.num_rounds, self.observation_size), name="Input_Layer"))

        self.r_layer = LSTM(units=24, activation='tanh', name='LSTM_Layer', input_shape=(self.observation_size, self.num_rounds))

        self.model.add(Dense(units=self.action_shape, activation='linear', name='Output_Layer'))

        self.optimizer = Adam(learning_rate=self.alpha)
        self.loss = mean_squared_error
        self.model.compile(loss=self.loss, optimizer=self.optimizer)

        # Define metrics for tensorboard
        self.score_log_path = 'logs/' +"%Y%m%d-%H%M%S")
        self.score_writer = tf.summary.create_file_writer(self.score_log_path)

    def compute_action(self, state, evaluation: bool=False):
        if (np.random.uniform(0, 1) < self.epsilon and evaluation == False):
            return np.random.choice(range(self.action_shape)) # ** With probability epsilon select a random action a_t **
            state = np.reshape(state, [1, self.num_rounds, self.observation_size])
            Q_values = self.model(state)[0]
            return np.argmax(Q_values) # ** Otherwise select a_t = argmax(Q) **

    def update_internal_state(self, pstate):
        self.states = np.roll(self.states, -1, axis=1)
        self.states[:, -1] = pstate

    def clear_internal_state(self):
        self.states = np.zeros((1, self.num_rounds, self.observation_size))

    # Updated Epsilon-Greedy-Update scheme
    def update_epsilon_parameter(self):
        self.epsilon = max(self.epsilon * math.exp(-self.epsilon_decay), self.min_epsilon)

    def store_step(self, cstate, action, reward, done, pstate):
        self.experience_memory.record(cstate, action, reward, done, pstate)

    def train(self): # **For each update step**
        # ** Sample random minibatch of transitions from D**
        b_cstate, b_action, b_reward, b_done, b_pstate = self.experience_memory.return_experience()

        # ** Set y_j **
        prediction_model_p = self.model(b_pstate)
        q_value_network = np.zeros((self.batch_size,))

        for i in range(self.batch_size):
            q_value_network[i] = float(1 - b_done[i]) * self.gamma * np.max(prediction_model_p[i])

        target_q_p = np.add(b_reward, q_value_network)

        # **Perform gradient descent step on Q**
        self.model.train_on_batch(b_cstate, target_q_p) # Performs better
        #, target_q_p, verbose=0, epochs=5)

Here are some changes in contrast to the DQNAgent, which have not been addressed yet:

  1. the form of the input has changed. Here the inputs are not passed in the dimensions [self.num_rounds, self.observation_size] (as easily assumed), but in the form [1, self.num_rounds, self.observation_size]. This has to do with the inputs to the layers.

  2. it is necessary to use some kind of internal memory to cache information from previous frames. In the internal memory is stored after each frame; here the oldest frame is removed.

  3. the internal memory is initialized to 00 at the beginning; the idea is not to provide any information to the algorithm in the first step. There might be better alternatives, but this seems to be the most promising one so far.
def training_loop(env, agent: DRQNAgent, max_frames_episode: int):
  current_obs = env.reset()
  episode_reward = 0

  for j in range(max_frames_episode):
    action = agent.compute_action(agent.states)

    next_obs, reward, done, _ = env.step(action)
    next_obs = np.array(next_obs)

    previous_states = agent.states.copy()
    agent.store_step(previous_states, action, reward, done, agent.states.copy())

    current_obs = next_obs
    episode_reward += reward

    if done:

  return episode_reward, agent

def evaluation_loop(env, agent: DRQNAgent, max_frames_episode: int):
  current_obs = env.reset()
  evaluation_reward = 0

  for j in range(max_frames_episode):
    # ** Execute action a_t in emulator and observe reward r_t and image x_{t+1}
    action = agent.compute_action(agent.states, evaluation=True)

    next_obs, reward, done, _ = env.step(action)
    next_obs = np.array(next_obs)

    # **Storing all information about the last episode in the memory buffer**
    previous_states = agent.states.copy()
    agent.store_step(previous_states, action, reward, done, agent.states.copy())

    # Reseting environment information so that next episode can handle the previous information
    current_obs = next_obs
    evaluation_reward += reward

    if done:

  return evaluation_reward, agent

For the training_loop and evaluation_loop it is important to note that not only the current information is stored in the experience memory, but the complete internal memory of the agent. This is necessary to have a meaningful input for the Deep Recurrent Q-Network during the training phase.
if __name__ == "__main__":
    env = gym.make("CartPole-v1")
    action_shape = env.action_space.n
    observation_shape = env.observation_space.shape[0]
    target_timesteps = 10

    n_episodes, max_frames_episode, avg_length, evaluation_interval = 1000, 500, 50, 10
    episodic_reward, avg_reward, evaluation_rewards = [], [], []

    # Seeding

    agent = DRQNAgent(observation_shape, action_shape, target_timesteps)

    for i in range(n_episodes):
        # **Observing state of the environment**
        episode_reward, agent = training_loop(env=env, agent=agent, max_frames_episode=max_frames_episode)

        print("Training Score in Episode {}: {}".format(i, episode_reward))

        if (i % evaluation_interval == 0):
        # Evaluation loop
            eval_reward, agent = evaluation_loop(env=env, agent=agent, max_frames_episode=max_frames_episode)
            print("Evaluation Score in Episode {}: {}".format(i, eval_reward))

        with agent.score_writer.as_default():
            tf.summary.scalar('Episodic Score', episode_reward, step=i)
            if (i % evaluation_interval == 0):
                tf.summary.scalar('Evaluation Score', eval_reward, step=i)

        avg_reward.append(avg_n(episodic_reward, n=avg_length))

There are no more functional changes in this part. As a small formal change, the number of time steps must be passed here.

The performance can be read off accordingly from the diagrams below. The first diagram corresponds to the time step size 22, the second diagram 55 and the last diagram corresponds to a time step size of 1010.

Finally, here is a comparison of the different time step sizes:

Note: Training performance can sometimes differ greatly from device to device and corresponding seeds. General reproducibility of such results can generally not be guaranteed.

Further information

Recurrent Layers can be initialized with the option ‘stateful=True’. This gives the corresponding layer its own internal memory, which does not need to be implemented. Furthermore, the network is able to process inputs of any length; only the most current time step must be entered into the model as input.

However, this also results in a number of problems and limitations that must be taken into account. Because the network does not know when an episode ends, all internal memories must be reset manually after each episode. Furthermore, the corresponding internal values must also be stored in an experience memory for each step; alternatively, a frame would be taken “out of context”. However, the internal values should also change over the training - so one feeds the network in parts with outdated information.

Both ways of implementation have their kind of advantages and disadvantages, however the version described in this post works relatively well.