 # Deep Q Network - Deep Learning and Reinforcement Learning in a simple way

Artificial neural networks have achieved success in many fields. Here I present the first algorithm for training neural networks for reinforcement learning: the Deep Q-Network. Henrik Bartsch

## Classification

For many years, researchers have been working on solving reinforcement learning tasks. One of the first algorithms to emerge from research in the field of Deep Reinforcement Learning was the Deep Q-Network (DQN for short). 1 Later, this algorithm was also used by Deepmind to play on Atari games and produce comparable or better performance than humans. 2 3

## Mathematical basics

### Concept of the Deep Q-Network

For an introduction to the terminology of “reinforcement learning” problems, this web page can be recommended.

The idea of the Deep Q-Network is to learn and choose the action at each step of the environment with which the maximum return is associated. The return is defined as

$R = \sum_{(t = 0)}^\infty \gamma^t r_t.$

The return is sometimes referred to as expected return or discounted reward.

Here $\gamma \in (0, 1)$ represents the discount factor and $r_t$ the reward by the environment in time step $t$. The discount factor is a quantity that represents how long and strong rewards have an effect on the return.

For $\gamma \rightarrow 1$ rewards always stay in the return and do not decay, for $\gamma \rightarrow 0$ rewards relalistically have an effect on the return only for a few (or extremely: one) time step.

Due to the fact that the neural network tries to approximate a return, one can speak here of a function approximator instead of the neural network. This is because the neural network tries to learn the best possible rewards for all possible actions depending on the observation. So one can call the neural network a

$Q: S_1 \times \cdots \times S_n \rightarrow A_1 \times \cdots \times A_m$

where $S_i, i=1, \cdots, n$ are the sets of different observations (for example, different angles for robots) and $A_j, j=1, \cdots, m$ are the sets of different actions (for example, move arm 1 or 2).

### Update formula In this formula it can be seen that one tries to minimize the loss function $L_i(\theta_i)$ with the network parameters $\theta_i$ of the Deep Q-Network. This is attempted via Mean Squared Error. Classically, this is done not only in the current step, but also via an Experience Replay $D$. 4 5 6

The Experience Replay contains old and new runs of episodes. It is intended to prevent the network from going through Catastrophic Forgetting; that is, the network forgets any progress it has made. There are a number of other functionalities implemented in later algorithms that also try to prevent catastrophic forgetting.

For this purpose, information about old runs is sampled (usually via a discrete uniform distribution of the indices), whereby in general no episodes but steps are extracted. An alternative to this can be found in Possible improvements.

Sampling from Experience Replay has the advantage of decoupling information between several different episodes, usually resulting in better training behaviors. In addition, one also keeps more information per training episode (especially early episodes in training when there are many environments) due to more steps considered in the environment, which increases the quality of the computed gradient. This can be detrimental if the current best step depends on the previous steps; however, Deep Q-Networks are generally not very good in such environments due to poor information merging between multiple steps. Here, one usually uses n-step agents such as the Deep Recurrent Q-Network.

Furthermore, Reward Clipping is used, which transforms $r_t$ to $r_t' \in [0, 1]$ using a linear transformation. This avoids two problems:

1. Neural networks require many learning steps to build weights to generate large values with acceptable error tolerance. If the reward is reduced, this speeds up training.

2. Environments with different sized rewards (for example: at the beginning $\approx 1$, later $\approx 30$) are more unstable. Reward clipping prevents this problem from occurring.

## Implementation using Tensorflow

### Problem

In this implementation, the agent will try to solve OpenAI’s CartPole problem. A categorical policy is used here. For more agent training environments, see here.

### Pseudocode ### Code

It should be noted that this implementation was tested with a Tensorflow version of 2.9.1. For lower versions, the imports may need to be changed from import tensorflow.python.keras.[...] to import tensorflow.keras.[...].

First, the experience memory is implemented.

The experience memory represents a simple container for memories of past transitions. With return_experience we additionally adjust the dimensionality of the output, so that the model accepts this later without problems.

Now we write our class DQNAgent. This contains the functions __init__, compute_action, update_epsilon_parameter, store_step, get_batch and train:

The comments in this function are relatively self-explanatory. The constructor requires the dimensionality of the input and output values, which are later used to initialize the first and last layers of the neural network. The hyperparameters are derived from experience and various hyperparameter optimizations; changing the model may require a different configuration of the hyperparameters. Furthermore, the Experience Replay is initialized and an I/O object for Tensorboard is created, which visualizes the training process. For Tensorflow Tensorboard is one of the best packages to display metrics during the training process.

Many blogs also refer to Seaborn or Matplotlib in this regard. Matplotlib always proved to be a simple and robust tool for visualizations of any kind, Seaborn advertises to be the more modern successor of many statistical visualization tools.

Deque is a dynamic array, which has a certain maximum length. If entries are added after reaching the maximum length, the last entries are replaced by the new entries.

We continue with the function for selecting the action under the current observation. Here an Epsilon Greedy Scheme is used, which selects with probability $\epsilon \in [0, 1]$ a random action, with probability $1 - \epsilon$ the action is selected by the network. This scheme serves the dilemma between exploration vs exploitation, which is authoritative in reinforcement learning. During training, $\epsilon$ should gradually decrease without falling below a threshold $\epsilon_{min}$. This is guaranteed by the update_epsilon_parameter function here. The function compute_action has the additional argument evaluation: bool=False here, which controls the activity of the epsilon greedy scheme.

Next comes the function that interacts with the Experience Replay. The method store_step gets a tuple of all information per time step (and the observation of the following step) and stores it in the Experience Replay. In this case, a direct interface could also be implemented, but this represents a more general interface. This can be modified internally if needed, without having to adapt surrounding code further (as for example later with the training loop).

Here comes the heart of the actual algorithm. In this method, we update the network based on some of the information we have gathered during the training. This function is the implementation of the update step in the pseudo code. Below, the line self.model.fit(b_cstate, target_q_p, verbose=0) is commented out, because mostly the method self.model.train_on_batch(b_cstate, target_q_p) turned out to be a better training method. It is possible that multiple epochs will also give better results on this, currently the network only trains on one episode.

The two functions for training and validating the learning progress follow. The function training_loop is only meant to be repeated over and over again to provide more information to the agent as it “moves” through the environment.

Note: In this implementation there is no update step after each time step, but after each episode.

The function evaluation_loop basically does nothing else, except that this one does not involve exploration.

Here you will find the initialization of all necessary parameters for the training loop and the training loop itself. Seeds were used here so that the training is reproducible, this should be removed in an enterprise environment to keep the possible results flexible. Furthermore you can see here the write operations of Tensorboard. Depending on the system, it may be necessary to initialize the writer. Mostly, however, this should work without this initialization. A visualization of the training performance follows:

Here you can see that within 1000 episodes ($\approx 7$ minutes CPU time) the performance of the model has improved compared to the beginning. There is more potential regarding the performance, with more adapted architectures and even better adapted hyperparameters.

Note: Training performance can sometimes differ greatly from device to device and corresponding seeds. General reproducibility of such results cannot generally be guaranteed.

### Implementation notes

1. tensorflow provides the functionality model.predict(observation) instead of model(observation). See here for this. Problematic here is that the predict function has a significantly larger runtime. So model(observation) should be used, unless there are special reasons why this should not be done.

## Alternatives of this algorithm

As described in the source 2, this network in its basic form has problems approximating the Q-target $y_i$. The reason is that this is not constant and so the network may exhibit instabilities in the learning process.

The paper provides a frozen target network as an extension. This is always updated every $n \in \mathbb{N}$ episode and used to compute the Q-target. The paper shows that this change leads to significantly better results than in the basic form.

Such an update of the network parameters of the Frozen Network every $n$ episodes is also called a Hard Update. Furthermore, a comparison between the average performance of the two algorithms:

Note: Training performance can sometimes differ greatly from device to device and corresponding seeds. General reproducibility of such results can generally not be guaranteed.

An improvement in training performance would be expected here. This cannot be seen here. Despite this fact, a reduction in variance can be seen.

## Implementation using Pytorch

After the detailed implementation using Tensorflow, reference is made to this website. There you can find a relatively good implementation using Pytorch, which technically should be able to reproduce the results.

## Possible improvements

One way to improve the training process is to use Importance Experience Replay. Importance Experience Replay improves the process by discarding the discrete uniform distribution used to sample the information from the Experience Replay. Instead, a probability is computed for each time step in the Experience Replay, which depends on the deviation between predicted and actual return. This means that potentially important time steps for the agent are repeated more often, instead of time steps that the agent already approximates well.

## Outlook

While the Deep Q-Network is no longer the most current or state-of-the-art algorithm for solving reinforcement learning problems, it provides the basis for many advanced algorithms such as D3QN or actor-critic architectures (where a DQN “scores” the observation-action pair). Accordingly, understanding this algorithm is central to advanced understanding of advanced reinforcement learning algorithms.

## Footnotes 