The texts in this article were partly generated by artificial intelligence and corrected and revised by us. The following services were used for the generation:

## Classification

*Reinforcement learning* has attracted a lot of attention. Models such as AlphaGo and AlphaStar have achieved great success and have greatly increased the interest in corresponding models in
various fields. Today I present one of the most fundamental algorithms that *reinforcement learning* produced in its early days: *Q-learning*.

Q-learning is sometimes called

Q-table, in reference to the matrix that underlies the method. In the following, the termQ-learningwill be used.

## Mathematical basics

Q-learning implementations are often a simple matrix initialized in one dimension with the number of possible input observations and in the other dimension with zeros for the number of possible actions. Subsequently, by the update formula

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_{a' \in A} Q(s_{t + 1}, a') - Q(s_t, a_t) \right]$

the state values of corresponding observations are adjusted iteratively. This equation is also referred to as the *Bellmann equation*.

In the equation above, $s_t$ stands for the observation at time $t \in \mathbb{N}$, $a_t \in A$ for the chosen action at time $t$, and $r_t$ for the reward granted at time $t$ after
applying action $a_t$. The space of all possible actions is denoted by $A$. ^{1} ^{2} ^{3}

In addition, Q-Learning also uses Epsilon-Greedy for exploration. This is a *random sampling* in the range $[0, 1]$. If the drawn sample is smaller than Epsilon, a random action is selected,
otherwise an action is selected after evaluation. ^{4}

In contrast to many other reinforcement learning algorithms based on artificial neural networks, it could be shown that Q-learning can in principle show guaranteed convergence. However, this is possible under very technical conditions, which are mostly ignored in reality. More information in this source.

In its basic version, Q-Learning can only handle discrete state and action spaces. Corresponding workarounds will be presented in the course of the post.

## The parameters

### The learning rate

As with the rest of the reinforcement learning algorithms, the parameter $\alpha$ also describes the learning rate, which should be in the range $\alpha \in [0,1]$.

This parameter describes the weighting with which new information is processed in the algorithm. Higher values of alpha weight new information (significantly) more than old information than would be the case with (very) small values.

In the case of $\alpha = 0$, the agent learns nothing, since changes made by the update formula cannot be incorporated into the Q-table.

In fully deterministic environments, a learning rate of $\alpha = 1$ is optimal. In stochastic environments, this value should be lower.

Despite the convergence guarantee under technical conditions, a constant, given value is usually used in reality. Even without a convergence guarantee, the algorithm often converges and behaves
robustly in the learning phase. ^{5}

### Gamma

The value $\gamma$ as the *discounting factor* determines an appropriate target size to be maximized. With large gamma values, the algorithm tries to maximize long-term rewards, while a small gamma
tends to lead to short-term optimizations.

If gamma is chosen greater than or equal to one, divergence in the algorithm may occur and the state values will increasingly approach *infinity*. Therefore, Gamma should be chosen smaller than
one. ^{1}

### Initial conditions

Due to the iterative nature of the algorithm, an initial condition must be entered in the first iteration of the algorithm. This condition forms the basis for solving the problem.

High initial condition values may make the algorithm more likely to try new combinations, since tried observation-action combinations quickly have lower values due to the updating formula,
and accordingly new actions are selected with higher probability. ^{1}

Author’s note: In practice, it is common to simply initialize a zero matrix, as this is not very costly. The results are usually good.

## Q-learning on continuous spaces

Q-learning fundamentally has the limitation that very large dimensions or infinite points for the number of observations and actions require large amounts of memory or are not directly supported. Therefore, to make Q-learning work on continuous spaces, there are basically two different possibilities.

### Discretization of the space

To get around the problem of a discrete space with infinitely many points, the space can be assumed to be a set of finite points relatively easily.

This can also be used for very large rooms (with very many data points). In this case, several data points are combined into one data point.

### Function approximation on the space

In principle, the problem can also be solved by using an adapted neural network for regression to transform the continuous input into a discrete input or the (discrete) output into a continuous
output. ^{1}

Author’s note: Even though both methods contribute to the solution of the problem in their basic principles, they partly lead to inefficient learning processes - caused, among other things, by the curse of dimensionality.

## Implementation

In the following we implement a simple version of *Q-Learning*, which should solve the *Taxi* problem. We start with the necessary imports and em *Experience Buffer*.

Next, the algorithm for *Q-learning* is implemented here. This version contains a linear $\epsilon$-decay with a relatively small decay value.

The last thing that is implemented here is the Training Loop to implement the training.

The results of an implementation without $\epsilon$ decay are visualized below:

### Epsilon Decay

Furthermore, it is common to control the $\epsilon$ value via training by a decay, similar to the *Deep Q-Network*. Here it is equally important to set a minimum value of $\epsilon_{min}$
for $\epsilon$ so that some level of exploration is always maintained.

For example, a decay can be achieved via an exponential or linear decay. Results on this have been visualized below:

Typically, these results are better with a $\epsilon$ decay, due to more efficient exploration at the beginning of the training and more efficient exploitation in later episodes of the training. This is not the case here, but should be classically implemented with.