The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:
How we use machine learning to create our articlesIntroduction
Sample efficiency and exploration stability are two of many problems that can arise with valuebased reinforcement learning such as DQN or D2QN. The actorcritic algorithm, which (largely) eliminates these problems, was developed as a result of this problem.
The actor critic algorithm is an onpolicy algorithm in reinforcement learning that combines the advantages of both policybased and valuebased methods. Both the value function and the policy function are learned simultaneously, allowing the agent to efficiently improve its behavior in complex and dynamic environments. Compared to other deep reinforcement learning algorithms, ActorCritic can also handle nonstationary environments and highdimensional observation spaces. In the following, I am going to explain this algorithm.
This algorithm is sometimes referred to as “REINFORCE with Baseline”. In this article, it will be referred to as the ActorCritic algorithm.
Technical Background
In our previous articles on the topic of deep reinforcement learning we have limited ourselves to socalled valuebased reinforcement learning. The basic idea is that we try to predict the reward of a certain action under a given state. In itself, this is not a bad approach, but there are some problems, especially in terms of the efficiency of the algorithms.
In this article, we combine valuebased reinforcement learning with policybased reinforcement learning. For both parts, we use a neural network that fulfills the following tasks:

Actor Network (Policy): In this neural network we try to map a parameterization of our action prediction. That is, by inputting a state of the environment, this neural network should provide us with a probability distribution of what actions the agent should take in the environment. The goal of the agent is to maximize the discounted return it will receive in the future. The agent’s strategy is updated by feedback from the environment and the critic’s evaluations.

Critic Network (Value Function): In this neural network, we try to map a value function that should estimate the future rewards for either a given stateaction pair or just one state of the environment. The role of the critic is to provide the agent with an estimate of the expected future rewards.
To summarize, the policy tries to find an action to select, while the value function tries to tell us how good our selected action (or its result) is. By using both a policy and a value function (instead of just a value function as in DQN), we use our training information more efficiently by training both neural networks at the same time. ^{1}
In addition to this fundamental change, in this algorithm we do not use $\epsilon$ greedy for exploration, but a softmax activation function. The background to this is described in the chapter Advantages and Disadvantages.
Implementation
We will start by adding all the necessary imports to our script.
Then we will define an auxiliary function that will give us a way to calculate a moving average of the data.
The next step is to create an experience memory. In contrast to the experience memories of the Deep Q Network, for example, we are content with a replay buffer that simply keeps the information about an episode in the order in which it was generated. Batching or shuffling of samples is not intended. You can read more about this in the chapter Advantages and Disadvantages.
In the following, we define an ActorCritic class that contains all the necessary functions for initialization, sample storage, and training. We start with the initialization of the class and all functions that have to do with the replay memory:
In the __init__([...])
method described above, the following happens

We first provide the necessary information to determine the dimensions of our observation space and our action space.

The hyperparameters $\gamma$, $\alpha_1$, $\alpha_2$ are defined. $\gamma$ is the discount factor that determines how heavily past information is weighted in training. The $\alpha_1$ and $\alpha_2$ are the learning rates of the two neural networks, where the index $1$ refers to the actor and the index $2$ to the critic.

In the next step, the two neural networks
self.actor_model
andself.critic_model
are initialized together with the experience memory. It is important to make sure that the actor in the last layer has asoftmax
activation, as this converts the output into a probability distribution. For the critic, the layer has only one neuron, since a onedimensional regression is to be performed with this network. 
Finally, the two optimizers are initialized. The two previously defined learning rates are passed.
The two functions store_episode([...])
and clear_episode([...])
are used to manage the experience memory.
The next step looks at an implementation of how to extract an action from an input:
The sample_action
function is made up of the following steps:

First, the dimension of the input data is increased by one dimension. This is because the implementation of a neural network used here wants to pass a series of input data in this way.

In the next step, the adjusted data is transferred into the neural network for further processing.

In the final step, the data output here is used to extract an action from the probability distribution. This is done using a categorical probability distribution provided by the `tensorflow_probability’ package.
If you are using an environment that wants to pass continuous action values instead of discrete values, you cannot use tfp.distributions.Categorical. Instead, a probability distribution designed specifically for this use case must be used, such as a normal distribution.
In the last step of the definition of the ActorCritic class, we still need functionality regarding training. An implementation might look like this:
In the next step of our code definition, we use the training_loop
function for the training loop. This is implemented in the same way as the training loops of the
“Deep Q Network” algorithm.
Finally, all of the definitions and commands that are necessary for the start of the training session must be entered:
After executing the script defined above, we can evaluate all the calculated data. In the following, we have averaged the episodic reward over $50$ episodes and plotted it over the course of the training:
In order to be able to better understand the data computed here, we have also prepared a visualization of the data of the Actor Critic algorithm in comparison to that of the “Deep Q Network” algorithm.
From the data available here, it can be interpreted that the actorcritic algorithm represents a significant technological advance over the now relatively old “Deep Q Network” algorithm. There is already a significant difference in performance within the first $100$ episodes, which increases over the remaining $900$ episodes.
Advantages and Disadvantages
The ActorCritic algorithm is a popular algorithm in the field of deep reinforcement learning. The combination of valuebased methods of the critic and policybased methods of the actor results in specific advantages and disadvantages.
Advantages
Efficiency and scalability
By using policybased reinforcement learning, we can optimize highdimensional problems more efficiently with this algorithm than with valuebased methods. ^{2} ^{3}
Sample efficiency and stability
By using valuebased reinforcement learning, we are able to achieve very good sampling efficiency and stability with this algorithm. This makes it a reliable option for reinforcement learning applications. ^{2}
Balance between exploration and exploitation
Using the softmax activation function, we can convert a vector of numerical values into a probability distribution vector. We can then randomly draw our actions from this vector. ^{4} We can describe the exploration and exploitation process as follows:

Exploration: By converting to a probability distribution vector, we always have a residual probability that an action will be drawn with a low probability. This encourages exploration, as the agent will try different actions over time that would not normally have been predicted by the model.

Exploitation: By applying the softmax function, the actions with the highest numerical values will later have the highest probabilities. Because of this property, our method will continue to select the actions that the agent feels are best with a higher probability.
Parallelizability
One way to improve the algorithm is through parallelization. This involves initializing multiple parallel instances of the environment, including individual actorcritic models that are periodically synchronized, often by a central model that is not trained directly on an instance and is used only for synchronization. This improvement is more stable and efficient because the data is decorrelated across instances, and thus more training information is generated than if it were generated by a single instance. ^{5} ^{6}
For more information, see the paper on the Asynchronous Advantage Actor Critic (A3C).
Disadvantages
Convergence Difficulties
Convergence problems can occur with the actorcritic algorithm. In another formulation, the behavior of our agent is determined by the step size $\alpha$ and the subsequent update, which results from the policy gradient update function:
$\theta \leftarrow \theta + \alpha \nabla J(\theta)$
Two types of problems can occur with such an update:

Overshooting: The update misses the maximum of the reward and ends up in a region of the parameter space where we get a suboptimal result in terms of the reward.

Undershooting: By using an unnecessarily small step size, we need a large number of updates to reach the optimum of our model parameters.
Overshooting is not too much of a problem in supervised learning. With constant data, the optimizer is able to correct overshooting in the next training episode. However, undershooting will also slow down convergence.
However, unlike supervised learning, overshooting in the policy gradient domain is potentially dramatic in deep reinforcement learning. If a parameter update leads to poor model behavior, it is possible that no useful information can be gained from future updates. Ultimately, this can lead to the model never improving again due to a single bad update, and thus no longer having a learning effect. ^{7}
Problems with Convergence Guarantee
The convergence guarantee is an important aspect of reinforcement learning. The convergence guarantee ensures that the algorithm will eventually find an optimal or near optimal solution. However, convergence can be difficult for actorcritic methods for several reasons: ^{1}

Two networks: Actorcritic methods use two separate networks (the actor network and the critic network) that must be trained simultaneously. This can lead to instability, as improvements or degradations in one network can affect the performance of the other.

High variance: Actorcritic methods can have a high variance in the gradient estimates. This can cause the algorithm to get stuck in local minima or fail to converge.
Hyperparameter sensitivity
The actorcritic algorithm is highly sensitive to the hyperparameters. This means that the choice of hyperparameters can have a strong influence on the result of the training and the resulting performance of the model. There are two main reasons for this:
 Instead of one neural network as in the Deep Q Network algorithm, two neural networks are available. This results in a significantly higher number of hyperparameters relevant for training.
In principle, an attempt can be made to train a joint neural network for Actor and Critic. The data is entered into the neural network and a separation is then made in the course of the network and two data outputs are defined: Once the probability distribution for the Actor and the regression value for the Critic. This is possible with the Tensorflow Functional API, for example, but does not eliminate the problem of more hyperparameters. It merely attempts to reduce the hyperparameter space. Furthermore, there is no guarantee that this actually leads to better results than with two different networks.
 Higher variance of gradients: A potentially higher variance of the gradients is achieved by the combined estimation of the updates from both the actor network and the critic network.
Overall, the hyperparameter sensitivity of the algorithm means that hyperparameter optimization must be performed more frequently than with other algorithms to achieve optimal results. ^{8}
Lower sample efficiency
Compared to offpolicy reinforcement learning algorithms, a different type of memory buffer is used here. In the memory buffer of offpolicy algorithms, we can store and reuse a lot of information from the past. For onpolicy algorithms, this is generally not useful because reusing old information does not necessarily improve model performance. ^{9}
Author’s note: In my experiments with an offpolicy memory buffer, the learning process even collapsed completely, so that the model stopped learning at all. This behavior can probably be explained by the policy gradient theorem.
It is important to note that despite these challenges, actorcritic methods often work well in practice and are used successfully for many tasks. There are also many variations and improvements to the basic actorcritic method that aim to reduce these problems.
TL;DR
In this post, we introduced a further development in the field of deep reinforcement learning: The ActorCritic algorithm. It is characterized by using and training a policy network in addition to a value network. The advantages of this algorithm are the efficiency of the learning process and the ability to handle highdimensional data. On the other hand, there is the sensitivity to hyperparameters or potential convergence problems that can occur and hinder the learning process of this algorithm.