Grafik von Conny Schneider –

On-Policy and Off-Policy: What is the difference?

The purpose of this article is to introduce the reader to the basic topic of reinforcement learning. The introduction concretizes important terms and basic concepts used there.

Henrik Bartsch

Henrik Bartsch

The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:


Reinforcement learning is a complex topic with complex algorithms. Reinforcement learning algorithms have a large number of parameters and features that determine (significantly) the performance and efficiency of an algorithm. Today we are dealing with a relatively large area - on-policy learning vs. off-policy learning. Reinforcement learning algorithms are divided into these two learning classes and we explain both in this post.

The basic idea of reinforcement learning

Reinforcement learning is an algorithmic approach to having one (or more) agents perform actions in an environment to optimize a predefined target variable. Typically, it attempts to maximize the reward or discounted reward, which can be defined as an objective function by various environmental variables. Using the objective variable, a reinforcement learning algorithm can now incorporate a penalty or reward into the underlying mathematical model via an update formula to (hopefully) achieve better performance in the next iteration of the environment than before. And here we come to the heart of the matter: the update formula.

Within the update formula, several distinctions can be made. For example, parameters such as the learning rate α\alpha or the discounting factor γ\gamma are known to influence the learning process. Here, however, we are concerned with a different decision: Does the update formula follow the trajectory of observations and actions, or does it try to optimize the mathematical model in some other way? 1 2 3

On-Policy vs Off-Policy

The distinction between the two types of learning, on-policy and off-policy, arises from the “exploration vs. exploitation” dilemma. Ultimately, it is important to explore the space of all combinations of observations and actions with a sufficiently large sample without getting stuck in a local optimum. The distinction between on-policy and off-policy here refers to how the agent uses its experience to determine its strategy (policy). Basically, we can distinguish the two learning classes as follows:

  1. In on-policy learning, we have only one model (and thus only one resulting strategy) that simultaneously learns and explores the environment.

  2. In off-policy learning, we theoretically have two separate models. One model is responsible for exploring the environment (Behavior-Policy) and another for learning the corresponding state action values (Target-Policy or Update-Policy).

In practice, the two models are not necessarily completely separate. In this case, it makes sense to update the behavioral policy by adopting the values of the target policy or by performing an appropriate soft update.

The advantage of on-policy learning is that it is consistent with the agent’s experience and makes no assumptions about the optimal strategy. The disadvantage is that it can be detrimental to convergence in some circumstances because it requires more exploration to find good or optimal solutions. This is partly due to the fact that bad actions may be repeated more often before being stopped by training.

The advantage of off-policy learning is that it can converge faster because it directly targets the optimal strategy and requires less exploration of the environment. The disadvantage of off-policy learning is that it can be inconsistent with the agent’s experience and can have high variance. 1 2

SARSA - On-Policy

To consider the nature of on-policy learning, we go back to the tabular methods, and thus in the direction of Q-learning. The SARSA algorithm is often cited here as a simple example of on-policy. SARSA stands for State-Action-Reward-State-Action and in this sense describes the update formula

Q(st,at)Q(st,at)+α[rt+γQ(st+1,at+1Q(st,at))].Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma Q \left(s_{t + 1}, a_{t + 1} - Q(s_t, a_t) \right) \right].

It is easy to see that the update formula only uses actions and observations that have been generated by the algorithm itself or by an exploration component. A classic example would be the _ϵ\epsilon greedy exploration.

Overall, on-policy algorithms have the advantage that the optimal action in any given step is chosen more often than other actions, and thus converts more quickly. Problematically, however, the greater use of learned knowledge also means that there is a risk that the model will “run” into a local optimum and get stuck there. 1 4

Q-Learning - Off-Policy

As a simple example of off-policy learning, the algorithm from Q-Learning has become popular. The update formula is

Q(st,at)Q(st,at)+α[rt+γmaxaAQ(st+1,a)Q(st,at)].Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_{a' \in A} Q(s_{t + 1}, a') - Q(s_t, a_t) \right].

A closer look reveals the difference between Behavior-Policy and Target-Policy: The Behavior-Policy is used to generate an action by exploration. This is used throughout the episode until the episode ends. Then, for each executed time step, an update of the state values is performed based on the maximum in the Q-table of the current time step.

The main difference between SARSA and Q-learning lies in a very central point:

  1. In SARSA, we work in training over the entire episode on the selected actions controlled by the term Q(st+1,at+1)Q(s_{t + 1}, a_{t + 1}). The action on which the update is performed does not change over the episode. Such an update corresponds to the current strategy of the agent.

  2. In Q-learning, the episode is trained over the best action, which is given by the term maxaAQ(st+1,a)max_{a' \in A} Q(s_{t + 1}, a'). It can happen that an action is assigned a higher value early in the episode - and thus a different action is updated than was actually selected. This is the target strategy. Such an update does not represent the current strategy of the agent, but an optimal strategy.

Off-policy often lends itself to a better perspective on appropriate action than would be the case with on-policy. This also reduces the risk of getting stuck in a local optimum. 1 2

Another reason for using off-policy algorithms (especially in the area of deep learning) is the ability to use experience replays, which increase the sampling efficiency of such algorithms. For on-policy algorithms, the use of experience replays is not possible because old information is stored in an experience. The current strategy does not necessarily need to follow these trajectories in later phases of training, so we would use different data for training than actually intended.

The Sample Efficiency is a term that describes how well a corresponding data point contributes to the learning process of the algorithm.



  1. 2 3 4

  2. 2 3