Deep Recurrent Q Network - DQN with a look into the past
Deep Q-Networks sometimes need information from different time steps to converge quickly. The Deep Recurrent Q-Network represents one possibility for this.
Henrik Bartsch
Classification
A recent post explained the approach and implementation of Deep Q-Networks. However, DQN is just one of the basic algorithms in Reinforcement Learning, research in the field
shows a number of possible improvements. One of them is Deep Recurrent Q-Network, which uses Recurrent Layers, to process more information at the same time and to better
handle the interactions of different information.
Basics
Deep Q-Networks have a number of practical limitations. In the following, we will present the most important limitation, which could be solved by Deep Recurrent Q-Networks:
Suppose an algorithm is to learn the game Pong. Pong(https://en.wikipedia.org/wiki/Pong) is a competitive two-player arcade game in which two discs are used to interact with and
move a ball through the playing field. The goal of each player is to move the ball in such a way that the other player cannot intercept the ball and leave the playing edge at the
opponent; in this case, the player scores a point. One possible modeling here is to always pass the current observation (i.e., the state of the screen as an RGB array) to the agent.
Furthermore, let’s assume we are in a fixed point in time and see the ball as in the picture below.
Now a question arises: how will the ball move forward depending on the current state? The answer is quickly clear - there is no deterministic answer to this question. To answer this
question, an observer needs at least two time steps (and information about the width of the time step) to calculate a trajectory and velocity. So the idea was to define a Deep Q-Network
as a functional model. A functional model here is a model that can have multiple inputs and outputs, so it does not necessarily
map a classical graph, as is the case with the classical sequential model for example.
Tensorflow users who have already worked with Recurrent Models certainly know that this approach works, but is inefficient. Recurrent Models are better suited to solve such problems
than their feed-forward counterparts because of their special structure.
Recurrent models receive at input (if defined) information from several time steps before the current time step. Internally in the Recurrent Layers the interaction between the
individual time steps is modeled, in order to achieve better results with problems such as Sequence Forecasting or Time Forecasting.
The approach above was implemented even more difficult in 3. Here Pong was implemented, but with the peculiarity that every now and then frames were passed to the agent empty
(i.e., completely without information). An algorithm like the Deep Q-Network, which is designed for simple state-to-state transitions, cannot make a meaningful decision here. A
Deep Recurrent Q-Network, which accepts, for example, five time steps as input, can still perform meaningfully here because it has all the necessary information and can potentially
interpolate the position of the ball. The authors also show that the Recurrent model prevails over the Feed-Forward model.
Implementation
In this version, an implementation of a Deep Recurrent Neural Network, which includes LSTM layers.
Alternative layers for this type of task include the Gated Recurrent Unit (GRU) or the
Simple Recurrent Neural Network layer (SimpleRNN).
The imports are identical to the implementation of a DQN:
An implementation of an outsourced Experience Replay is straight-forward:
Note: The Experience in this memory is converted to Tensorflow tensors on output to be able to execute the training later as @tf.function. This kind of
functionality is a conversion of the operation into a graph, so that contained operations can be executed faster.
The more detailed functionality of @tf.functions can be read here.
The actual agent can be implemented in the following way:
Here are some changes in contrast to the DQNAgent, which have not been addressed yet:
the form of the input has changed. Here the inputs are not passed in the dimensions [self.num_rounds, self.observation_size] (as easily assumed), but in the
form [1, self.num_rounds, self.observation_size]. This has to do with the inputs to the layers.
it is necessary to use some kind of internal memory to cache information from previous frames. In the internal memory is stored after each frame; here the
oldest frame is removed.
the internal memory is initialized to 0 at the beginning; the idea is not to provide any information to the algorithm in the first step. There might be better
alternatives, but this seems to be the most promising one so far.
For the training_loop and evaluation_loop it is important to note that not only the current information is stored in the experience memory, but the complete internal
memory of the agent. This is necessary to have a meaningful input for the Deep Recurrent Q-Network during the training phase.
There are no more functional changes in this part. As a small formal change, the number of time steps must be passed here.
The performance can be read off accordingly from the diagrams below. The first diagram corresponds to the time step size 2, the second diagram 5 and the last diagram corresponds to
a time step size of 10.
Finally, here is a comparison of the different time step sizes:
Note: Training performance can sometimes differ greatly from device to device and corresponding seeds. General reproducibility of such results can generally not be guaranteed.
Further information
Recurrent Layers can be initialized with the option ‘stateful=True’. This gives the corresponding layer its own internal memory, which does not need to be implemented.
Furthermore, the network is able to process inputs of any length; only the most current time step must be entered into the model as input.
However, this also results in a number of problems and limitations that must be taken into account. Because the network does not know when an episode ends, all internal
memories must be reset manually after each episode. Furthermore, the corresponding internal values must also be stored in an experience memory for each step; alternatively,
a frame would be taken “out of context”. However, the internal values should also change over the training - so one feeds the network in parts with outdated information.
Both ways of implementation have their kind of advantages and disadvantages, however the version described in this post works relatively well.