The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:
Classification
Reinforcement learning is a complex topic with complex algorithms. Reinforcement learning algorithms have a large number of parameters and features that determine (significantly) the performance and efficiency of an algorithm. Today we are dealing with a relatively large area  onpolicy learning vs. offpolicy learning. Reinforcement learning algorithms are divided into these two learning classes and we explain both in this post.
The basic idea of reinforcement learning
Reinforcement learning is an algorithmic approach to having one (or more) agents perform actions in an environment to optimize a predefined target variable. Typically, it attempts to maximize the reward or discounted reward, which can be defined as an objective function by various environmental variables. Using the objective variable, a reinforcement learning algorithm can now incorporate a penalty or reward into the underlying mathematical model via an update formula to (hopefully) achieve better performance in the next iteration of the environment than before. And here we come to the heart of the matter: the update formula.
Within the update formula, several distinctions can be made. For example, parameters such as the learning rate $\alpha$ or the discounting factor $\gamma$ are known to influence the learning process. Here, however, we are concerned with a different decision: Does the update formula follow the trajectory of observations and actions, or does it try to optimize the mathematical model in some other way? ^{1} ^{2} ^{3}
OnPolicy vs OffPolicy
The distinction between the two types of learning, onpolicy and offpolicy, arises from the “exploration vs. exploitation” dilemma. Ultimately, it is important to explore the space of all combinations of observations and actions with a sufficiently large sample without getting stuck in a local optimum. The distinction between onpolicy and offpolicy here refers to how the agent uses its experience to determine its strategy (policy). Basically, we can distinguish the two learning classes as follows:

In onpolicy learning, we have only one model (and thus only one resulting strategy) that simultaneously learns and explores the environment.

In offpolicy learning, we theoretically have two separate models. One model is responsible for exploring the environment (BehaviorPolicy) and another for learning the corresponding state action values (TargetPolicy or UpdatePolicy).
In practice, the two models are not necessarily completely separate. In this case, it makes sense to update the behavioral policy by adopting the values of the target policy or by performing an appropriate soft update.
The advantage of onpolicy learning is that it is consistent with the agent’s experience and makes no assumptions about the optimal strategy. The disadvantage is that it can be detrimental to convergence in some circumstances because it requires more exploration to find good or optimal solutions. This is partly due to the fact that bad actions may be repeated more often before being stopped by training.
The advantage of offpolicy learning is that it can converge faster because it directly targets the optimal strategy and requires less exploration of the environment. The disadvantage of offpolicy learning is that it can be inconsistent with the agent’s experience and can have high variance. ^{1} ^{2}
SARSA  OnPolicy
To consider the nature of onpolicy learning, we go back to the tabular methods, and thus in the direction of Qlearning. The SARSA algorithm is often cited here as a simple example of onpolicy. SARSA stands for StateActionRewardStateAction and in this sense describes the update formula
$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma Q \left(s_{t + 1}, a_{t + 1}  Q(s_t, a_t) \right) \right].$
It is easy to see that the update formula only uses actions and observations that have been generated by the algorithm itself or by an exploration component. A classic example would be the _$\epsilon$ greedy exploration.
Overall, onpolicy algorithms have the advantage that the optimal action in any given step is chosen more often than other actions, and thus converts more quickly. Problematically, however, the greater use of learned knowledge also means that there is a risk that the model will “run” into a local optimum and get stuck there. ^{1} ^{4}
QLearning  OffPolicy
As a simple example of offpolicy learning, the algorithm from QLearning has become popular. The update formula is
$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_t + \gamma \max_{a' \in A} Q(s_{t + 1}, a')  Q(s_t, a_t) \right].$
A closer look reveals the difference between BehaviorPolicy and TargetPolicy: The BehaviorPolicy is used to generate an action by exploration. This is used throughout the episode until the episode ends. Then, for each executed time step, an update of the state values is performed based on the maximum in the Qtable of the current time step.
The main difference between SARSA and Qlearning lies in a very central point:

In SARSA, we work in training over the entire episode on the selected actions controlled by the term $Q(s_{t + 1}, a_{t + 1})$. The action on which the update is performed does not change over the episode. Such an update corresponds to the current strategy of the agent.

In Qlearning, the episode is trained over the best action, which is given by the term $max_{a' \in A} Q(s_{t + 1}, a')$. It can happen that an action is assigned a higher value early in the episode  and thus a different action is updated than was actually selected. This is the target strategy. Such an update does not represent the current strategy of the agent, but an optimal strategy.
Offpolicy often lends itself to a better perspective on appropriate action than would be the case with onpolicy. This also reduces the risk of getting stuck in a local optimum. ^{1} ^{2}
Another reason for using offpolicy algorithms (especially in the area of deep learning) is the ability to use experience replays, which increase the sampling efficiency of such algorithms. For onpolicy algorithms, the use of experience replays is not possible because old information is stored in an experience. The current strategy does not necessarily need to follow these trajectories in later phases of training, so we would use different data for training than actually intended.
The Sample Efficiency is a term that describes how well a corresponding data point contributes to the learning process of the algorithm.