Grafik von Sigmund –

Q-Learning: The Basics of Reinforcement Learning

Q-Learning was one of the first practical reinforcement learning algorithms. This post introduces this algorithm.

Henrik Bartsch

Henrik Bartsch

The texts in this article were partly generated by artificial intelligence and corrected and revised by us. The following services were used for the generation:


Reinforcement learning has attracted a lot of attention. Models such as AlphaGo and AlphaStar have achieved great success and have greatly increased the interest in corresponding models in various fields. Today I present one of the most fundamental algorithms that reinforcement learning produced in its early days: Q-learning.

Q-learning is sometimes called Q-table, in reference to the matrix that underlies the method. In the following, the term Q-learning will be used.

Mathematical basics

Q-learning implementations are often a simple matrix initialized in one dimension with the number of possible input observations and in the other dimension with zeros for the number of possible actions. Subsequently, by the update formula

Q(st,at)Q(st,at)+α[rt+γmaxaAQ(st+1,a)Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_{a' \in A} Q(s_{t + 1}, a') - Q(s_t, a_t) \right]

the state values of corresponding observations are adjusted iteratively. This equation is also referred to as the Bellmann equation.

In the equation above, sts_t stands for the observation at time tNt \in \mathbb{N}, atAa_t \in A for the chosen action at time tt, and rtr_t for the reward granted at time tt after applying action ata_t. The space of all possible actions is denoted by AA. 1 2 3

In addition, Q-Learning also uses Epsilon-Greedy for exploration. This is a random sampling in the range [0,1][0, 1]. If the drawn sample is smaller than Epsilon, a random action is selected, otherwise an action is selected after evaluation. 4

In contrast to many other reinforcement learning algorithms based on artificial neural networks, it could be shown that Q-learning can in principle show guaranteed convergence. However, this is possible under very technical conditions, which are mostly ignored in reality. More information in this source.

In its basic version, Q-Learning can only handle discrete state and action spaces. Corresponding workarounds will be presented in the course of the post.

The parameters

The learning rate

As with the rest of the reinforcement learning algorithms, the parameter α\alpha also describes the learning rate, which should be in the range α[0,1]\alpha \in [0,1].

This parameter describes the weighting with which new information is processed in the algorithm. Higher values of alpha weight new information (significantly) more than old information than would be the case with (very) small values.

In the case of α=0\alpha = 0, the agent learns nothing, since changes made by the update formula cannot be incorporated into the Q-table.

In fully deterministic environments, a learning rate of α=1\alpha = 1 is optimal. In stochastic environments, this value should be lower.

Despite the convergence guarantee under technical conditions, a constant, given value is usually used in reality. Even without a convergence guarantee, the algorithm often converges and behaves robustly in the learning phase. 5


The value γ\gamma as the discounting factor determines an appropriate target size to be maximized. With large gamma values, the algorithm tries to maximize long-term rewards, while a small gamma tends to lead to short-term optimizations.

If gamma is chosen greater than or equal to one, divergence in the algorithm may occur and the state values will increasingly approach infinity. Therefore, Gamma should be chosen smaller than one. 1

Initial conditions

Due to the iterative nature of the algorithm, an initial condition must be entered in the first iteration of the algorithm. This condition forms the basis for solving the problem.

High initial condition values may make the algorithm more likely to try new combinations, since tried observation-action combinations quickly have lower values due to the updating formula, and accordingly new actions are selected with higher probability. 1

Author’s note: In practice, it is common to simply initialize a zero matrix, as this is not very costly. The results are usually good.

Q-learning on continuous spaces

Q-learning fundamentally has the limitation that very large dimensions or infinite points for the number of observations and actions require large amounts of memory or are not directly supported. Therefore, to make Q-learning work on continuous spaces, there are basically two different possibilities.

Discretization of the space

To get around the problem of a discrete space with infinitely many points, the space can be assumed to be a set of finite points relatively easily.

This can also be used for very large rooms (with very many data points). In this case, several data points are combined into one data point.

Function approximation on the space

In principle, the problem can also be solved by using an adapted neural network for regression to transform the continuous input into a discrete input or the (discrete) output into a continuous output. 1

Author’s note: Even though both methods contribute to the solution of the problem in their basic principles, they partly lead to inefficient learning processes - caused, among other things, by the curse of dimensionality.


In the following we implement a simple version of Q-Learning, which should solve the Taxi problem. We start with the necessary imports and em Experience Buffer.
import numpy as np
import gym
class TrajectoryExperienceMemory:
    def __init__(self):
        self.cstate_memory, self.action_memory, self.reward_memory, self.pstate_memory = [], [], [], []

    def record(self, cstate, action, reward, pstate):

    def flush_memory(self):
        self.cstate_memory, self.action_memory, self.reward_memory, self.pstate_memory = [], [], [], []

    def return_experience(self):
        batch_cstates = np.array(self.cstate_memory, dtype=np.int64)
        batch_actions = np.array(self.action_memory, dtype=np.int64)
        batch_rewards = np.array(self.reward_memory, dtype=np.float64)
        batch_pstates = np.array(self.pstate_memory, dtype=np.int64)

        return (batch_cstates, batch_actions, batch_rewards, batch_pstates)

Next, the algorithm for Q-learning is implemented here. This version contains a linear ϵ\epsilon-decay with a relatively small decay value.
class QAgent:
    def __init__(self, observations: int, actions: int) -> None:
        self.observations = observations
        self.actions = actions

        # Learning Parameters
        self.alpha, self.gamma = 0.8, 0.99

        # Exploration Parameters
        self.epsilon, self.epsilon_decay = 1, 0.0001

        # Initialize Q-Matrix
        self.q = np.zeros((self.observations, self.actions))

        # Initialize Experience Replay
        self.memory = TrajectoryExperienceMemory()

    def sample_action(self, observation: int):
        if (np.random.random() < self.epsilon):
            self.epsilon -= self.epsilon_decay
            return np.random.randint(0, self.actions - 1)
            return np.argmax(self.q[observation])
    def store_step(self, cstate: int, action: int, reward: float, pstate: int):
        self.memory.record(cstate, action, reward, pstate)

    def update(self):
        cstates, actions, rewards, pstates = self.memory.return_experience()

        for i in range(len(cstates)):
            # Perform Q-Update according to Bellman Equation
            self.q[cstates[i], actions[i]] = self.q[cstates[i], actions[i]] + self.alpha * (rewards[i] + self.gamma * np.max(self.q[pstates[i]]) - self.q[cstates[i], actions[i]]) 

The last thing that is implemented here is the Training Loop to implement the training.
env = gym.make("FrozenLake-v1")
agent = QAgent(16, 4)

episodes = 10000
episodic_rewards = []

for episode in range(episodes):
    cstate, _ = env.reset()
    done, episodic_reward = False, 0

    while (done == False):
        cstate = int(cstate)
        action = agent.sample_action(cstate)
        pstate, reward, done, _, _ = env.step(action)
        episodic_reward += reward

        agent.store_step(cstate, action, reward, pstate)

        cstate = pstate

    print(f"Episodic Reward in Episode {episode}: {episodic_reward}")

The results of an implementation without ϵ\epsilon decay are visualized below:

Epsilon Decay

Furthermore, it is common to control the ϵ\epsilon value via training by a decay, similar to the Deep Q-Network. Here it is equally important to set a minimum value of ϵmin\epsilon_{min} for ϵ\epsilon so that some level of exploration is always maintained.

For example, a decay can be achieved via an exponential or linear decay. Results on this have been visualized below:

Typically, these results are better with a ϵ\epsilon decay, due to more efficient exploration at the beginning of the training and more efficient exploitation in later episodes of the training. This is not the case here, but should be classically implemented with.



  1. 2 3 4