Graphics from Shubham Dhage –

Introduction to Reinforcement Learning

This post is intended to introduce the reader to the basic topic of reinforcement learning. The introduction clarifies important terms and basic concepts that are used in Reinforcement Learning.

Henrik Bartsch

Henrik Bartsch


Machine Learning has become relatively popular in recent years, especially in connection with topics such as Big Data. An important research area of Machine Learning in this field is Supervised Learning, which is very good at making predictions on existing data sets. Research in this area is relatively advanced, but not always applicable.

The not necessarily applicability (or worse applicability) of Supervised Learning presents itself in some real world problems. Examples are the control of different machines or control loops, interaction with complex systems or many others. The commonality: all of these systems are better solved using feedback with the environment for training and task completion. Furthermore, optimal solutions are not necessarily clear, because what humans do is not always optimal.

Although humans usually perform well interacting with such systems, there is usually potential for improvement. Finding this potential for improvement is an interesting and complex area of research.

This is where Reinforcement Learning comes in: Training algorithms for this are specifically designed to use feedback. From this feedback, combinations of actions can be generated, which (depending on the current observations) provide good performance.

In the following, action combinations are called action trajectories.

Reinforcement Learning also uses artificial neural networks, as Supervised Learning does. However, in contrast to Supervised Learning, only a few algorithms are known that can accomplish the tasks without the use of artificial neural networks - usually these have some significant additional limitations. 1

The idea of Reinforcement Learning.

In Reinforcement Learning one thing is particularly important: feedback through interaction with an environment that abstracts the problem. The degree of abstraction changes the learning behavior and the necessary runtime the model needs to deliver a “good performance”.

The degree of abstraction of the problem represents a primary model issue that improves or decreases the real-world performance of such a machine learning model.

What constitutes “good performance” in this problem is not directly defined here - the definition of this is up to the user. Often statistics or runtime metrics are used to make a final selection.

The feedback

The feedback in a reinforcement learning agent is usually defined via a reward function. This function does not necessarily have to be known externally - i.e., from the user’s point of view - but should be known when defining and modeling the environment.

The goal of the agent is it again and again by interaction either the Reward at each place to maximize or alternatively a Discounted Reward or also Return.

The Return is a Reward metric, in which rewards of future steps fall in addition to the current reward. Since future steps play a role here, these are usually computed later and not during the interaction with the environment - for example, whenever the neural network is to be updated. 2 3

The so-called Reward Hypothesis plays a large role here:

All goals, which can be achieved by the agent, can be considered as maximizing the return. 4

Exploration vs. Exploitation

For algorithmics, the problem of Exploration vs. Exploitation is a relevant problem. This describes how much the agent in its interaction should concentrate on finding new information and trajectories (Exploration) or rely on old information and reinforce this information by means of training (Exploitation).

Necessary for “exploring” the environment in Exploration and corresponding “good” action trajectories depending on the current time and information is an Exploration component. This describes usually by means of a random component, which action is to be selected or adds on a probability distribution a (small) disturbance.

The Exploitation turns out not further difficult, because this one receives from the Exploration, by removing or suppressing the random component. 3

Use case of Reinforcement Learning.

There are three different situations in which Reinforcement Learning is an incredibly powerful tool:

  1. a model abstraction of the environment is known or definable, but no analytical (exact) solution is known,
  2. only a simulation of the environment is available; interaction with the problem directly is too expensive or dangerous, or
  3. the only available way to gather information about the system is to interact with it - for example, in the case of unknown systems.

Advantages of Reinforcement Learning 5

  1. Reinforcement Learning can be used to solve very complex problems that cannot be solved by conventional techniques.
  2. Reinforcement Learning it the prefered method of achieving long-term results, which are, dependent on the environment, very difficult to achieve.
  3. Due to continuous interaction with the environment, these models can correct the incorrect behaviour that occurred or was learned during earlier parts of the training process.
  4. Reinforcement Learning models are intended to achieve the optimal behavior of a computer model within a specific environment, which means to maximize its performance. It can in theory create the perfect model to solve a particular problem.

Disadvantages of Reinforcement Learning 5

  1. Some Reinforcement Learning algorithms require precise hyperparameters to converge. Finding out the best hyperparameters is usually expensive, but sometimes a best-guess can do the trick.

  2. Some Reinforcement Learning algorithms tend to perform a bit worse in simpler environments, especially the more complex algorithms.

  3. It needs a lot of data and a lot of computation, which is why we require a model to sample data from.

  4. Reinforcement Learning is not always directly applicable to the real world due to safety or cost concerns.

An example of such a scenarion could be autonomous driving, which is trained in the real world. In earlier training stages, the model could endanger passengers or its surroundings. A class of algorithms that try and work around this problem is known by Safe Reinforcement Learning.

  1. This class of algorithms assumes the world can be classified as a markovian process, which it usually is not.

Being a marcovian process means, that the current observation was only dependent on the last action and the last observation. Although most problems are not Markov processes, Reinforcement Learning still yields good results in practice.

Application examples of Reinforcement Learning.

Through intensive research, several areas for Reinforcement Learning have already been identified as practical applications. Examples include: 6

  1. interaction with humans in video or board games (see AlphaGo or AlphaStar)
  2. robotics (see Cheetah by Boston Dynamics or Dactyl by OpenAI)
  3. production environments, such as factories.

There are quite a number of potential application examples in which Reinforcement Learning could create major breakthroughs in the next few years:

  1. autonomous driving
  2. text processing (see for example ChatGPT by OpenAI)

and many other exciting and complex topics.





  3. 2


  5. 2