Graphics from Shahadat Rahman –

The Actor Critic Algorithm: The Key to Efficient Reinforcement Learning

Actor-critic reinforcement learning is a significant advancement in the field of reinforcement learning. Actor-critic reinforcement learning combines the advantages of both policy-based and value-based reinforcement learning, allowing for more efficient and effective learning in complex environments. In this post, I would like to introduce this algorithm.

Henrik Bartsch

Henrik Bartsch

The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:


Sample efficiency and exploration stability are two of many problems that can arise with value-based reinforcement learning such as DQN or D2QN. The actor-critic algorithm, which (largely) eliminates these problems, was developed as a result of this problem.

The actor critic algorithm is an on-policy algorithm in reinforcement learning that combines the advantages of both policy-based and value-based methods. Both the value function and the policy function are learned simultaneously, allowing the agent to efficiently improve its behavior in complex and dynamic environments. Compared to other deep reinforcement learning algorithms, Actor-Critic can also handle non-stationary environments and high-dimensional observation spaces. In the following, I am going to explain this algorithm.

This algorithm is sometimes referred to as “REINFORCE with Baseline”. In this article, it will be referred to as the Actor-Critic algorithm.

Technical Background

In our previous articles on the topic of deep reinforcement learning we have limited ourselves to so-called value-based reinforcement learning. The basic idea is that we try to predict the reward of a certain action under a given state. In itself, this is not a bad approach, but there are some problems, especially in terms of the efficiency of the algorithms.

In this article, we combine value-based reinforcement learning with policy-based reinforcement learning. For both parts, we use a neural network that fulfills the following tasks:

  1. Actor Network (Policy): In this neural network we try to map a parameterization of our action prediction. That is, by inputting a state of the environment, this neural network should provide us with a probability distribution of what actions the agent should take in the environment. The goal of the agent is to maximize the discounted return it will receive in the future. The agent’s strategy is updated by feedback from the environment and the critic’s evaluations.

  2. Critic Network (Value Function): In this neural network, we try to map a value function that should estimate the future rewards for either a given state-action pair or just one state of the environment. The role of the critic is to provide the agent with an estimate of the expected future rewards.

To summarize, the policy tries to find an action to select, while the value function tries to tell us how good our selected action (or its result) is. By using both a policy and a value function (instead of just a value function as in DQN), we use our training information more efficiently by training both neural networks at the same time. 1

In addition to this fundamental change, in this algorithm we do not use ϵ\epsilon greedy for exploration, but a softmax activation function. The background to this is described in the chapter Advantages and Disadvantages.


We will start by adding all the necessary imports to our script.
import pandas as pd
import as px
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np

import gym

from tensorflow.python.keras import Sequential
from tensorflow.python.keras.optimizer_v2.adam import Adam
from tensorflow.python.keras.layers import InputLayer, Dense
from tensorflow.python.keras.metrics import Mean

from tensorflow_probability import distributions

Then we will define an auxiliary function that will give us a way to calculate a moving average of the data.
def avg_n(list1, n = 50):
    if (len(list1) > n):
        return np.average(list1[len(list1) - n:len(list1)])
        return np.average(list1[0:len(list1)])

The next step is to create an experience memory. In contrast to the experience memories of the Deep Q Network, for example, we are content with a replay buffer that simply keeps the information about an episode in the order in which it was generated. Batching or shuffling of samples is not intended. You can read more about this in the chapter Advantages and Disadvantages.
class TrajectoryExperienceMemory:
    def __init__(self):
        self.cstate_memory, self.action_memory, self.reward_memory, self.pstate_memory = [], [], [], []

    def record(self, cstate, action, reward, pstate):

    def flush_memory(self):
        self.cstate_memory, self.action_memory, self.reward_memory, self.pstate_memory = [], [], [], []

    def return_experience(self):
        batch_cstates = tf.convert_to_tensor(self.cstate_memory)
        batch_actions = tf.convert_to_tensor(self.action_memory)
        batch_rewards = self.reward_memory.copy()
        batch_pstates = tf.convert_to_tensor(self.pstate_memory)

        return (batch_cstates, batch_actions, batch_rewards, batch_pstates)

In the following, we define an Actor-Critic class that contains all the necessary functions for initialization, sample storage, and training. We start with the initialization of the class and all functions that have to do with the replay memory:
class ACAgent:
    def __init__(self, observation_size, action_size):
        self.observation_size = observation_size
        self.action_size = action_size

        self.gamma = 0.85 # Discount Factor
        self.alpha1 = 10e-5 # Learning Rate
        self.alpha2 = 10e-2 # Learning Rate

        self.memory = TrajectoryExperienceMemory()

        self.actor_model = Sequential([
            Dense(units=64, activation='relu',),
            Dense(units=self.action_size, activation='softmax')],

        self.critic_model = Sequential([
            Dense(units=64, activation='relu'),
            Dense(units=64, activation='relu'),
            Dense(units=1, activation=None)],

        self.actor_optimizer = Adam(learning_rate=self.alpha1)
        self.critic_optimizer = Adam(learning_rate=self.alpha2)

    def store_episode(self, cstate, action, reward, pstate):
        cstate = tf.expand_dims(cstate, axis=0)
        pstate = tf.expand_dims(pstate, axis=0)
        self.memory.record(cstate, action, reward, pstate)

    def flush_memory(self):

In the __init__([...]) method described above, the following happens

  1. We first provide the necessary information to determine the dimensions of our observation space and our action space.

  2. The hyperparameters γ\gamma, α1\alpha_1, α2\alpha_2 are defined. γ\gamma is the discount factor that determines how heavily past information is weighted in training. The α1\alpha_1 and α2\alpha_2 are the learning rates of the two neural networks, where the index 11 refers to the actor and the index 22 to the critic.

  3. In the next step, the two neural networks self.actor_model and self.critic_model are initialized together with the experience memory. It is important to make sure that the actor in the last layer has a softmax activation, as this converts the output into a probability distribution. For the critic, the layer has only one neuron, since a one-dimensional regression is to be performed with this network.

  4. Finally, the two optimizers are initialized. The two previously defined learning rates are passed.

The two functions store_episode([...]) and clear_episode([...]) are used to manage the experience memory.

The next step looks at an implementation of how to extract an action from an input:
def sample_action(self, observation):
    observation = tf.expand_dims(observation, axis=0)
    prob = self.actor_model(observation)

    distribution = tfp.distributions.Categorical(probs=prob, dtype=tf.float32)
    action = distribution.sample()
    return int(action[0])

The sample_action function is made up of the following steps:

  1. First, the dimension of the input data is increased by one dimension. This is because the implementation of a neural network used here wants to pass a series of input data in this way.

  2. In the next step, the adjusted data is transferred into the neural network for further processing.

  3. In the final step, the data output here is used to extract an action from the probability distribution. This is done using a categorical probability distribution provided by the `tensorflow_probability’ package.

If you are using an environment that wants to pass continuous action values instead of discrete values, you cannot use tfp.distributions.Categorical. Instead, a probability distribution designed specifically for this use case must be used, such as a normal distribution.

In the last step of the definition of the Actor-Critic class, we still need functionality regarding training. An implementation might look like this:
def actor_loss(self, probs, action, reward):
    distribution = tfp.distributions.Categorical(probs=probs, dtype=tf.float32)
    log_probs = distribution.log_prob(action)
    actor_loss = - log_probs * reward

    return actor_loss

def train(self, cstates, actions, discounted_rewards, pstates):
    ## Update critic network
    cstates = tf.squeeze(cstates, axis=1)
    pstates = tf.squeeze(pstates, axis=1)

    with tf.GradientTape() as tape1:
        c_value = tf.squeeze(self.critic_model(cstates, training=True))
        p_value = tf.squeeze(self.critic_model(pstates, training=True))

        mask = tf.eye(c_value.shape[0])[-1, :] - 1 # Unit vector - 1
        temp_difference = discounted_rewards - self.gamma * (mask * p_value) - c_value

        critic_loss = tf.square(temp_difference)

    critic_gradients = tape1.gradient(critic_loss, self.critic_model.trainable_variables)
    self.critic_optimizer.apply_gradients(zip(critic_gradients, self.critic_model.trainable_variables))

    ## Update Actor Model incrementally
    cstates = tf.expand_dims(cstates, axis=1)
    for i in range(temp_difference.shape[0]):
        with tf.GradientTape() as tape2:
            probs = self.actor_model(cstates[i], training=True)
            actor_loss = self.actor_loss(probs, actions[i], temp_difference[i])

        actor_gradients = tape2.gradient(actor_loss, self.actor_model.trainable_variables)
        self.actor_optimizer.apply_gradients(zip(actor_gradients, self.actor_model.trainable_variables))

def update(self):
    cstates, actions, rewards, pstates = self.memory.return_experience()
    self.train(cstates, actions, tf.convert_to_tensor(rewards), pstates)

In the next step of our code definition, we use the training_loop function for the training loop. This is implemented in the same way as the training loops of the “Deep Q Network” algorithm.
def training_loop(env, agent: ACAgent, max_frames_episode):
  current_obs, _ = env.reset()
  episode_reward = 0


  for j in range(max_frames_episode):
    action = agent.sample_action(current_obs).numpy()

    next_obs, reward, done, _, _ = env.step(action)
    next_obs = np.array(next_obs)

    agent.store_episode(current_obs, action, reward, next_obs)

    current_obs = next_obs
    episode_reward += reward

    if done:

  return episode_reward, agent

Finally, all of the definitions and commands that are necessary for the start of the training session must be entered:
n_episodes, max_frames_episode, current_episode, avg_length, evaluation_interval = 1000, 500, 0, 50, 10
episodic_reward, avg_reward, evaluation_rewards = [], [], []

env = gym.make("CartPole-v1")

seed = 69

n_actions = env.action_space.n
observation_shape = env.observation_space.shape[0]

agent = ACAgent(observation_shape, n_actions)
for i in range(n_episodes):
    current_episode += 1
    episode_reward, agent = training_loop(env, agent, max_frames_episode)

    current_average = avg_n(episodic_reward, n=avg_length)

    print(f"[i={i}] Episodic reward: {episode_reward} | Current running average reward: {current_average}")

After executing the script defined above, we can evaluate all the calculated data. In the following, we have averaged the episodic reward over 5050 episodes and plotted it over the course of the training:

In order to be able to better understand the data computed here, we have also prepared a visualization of the data of the Actor Critic algorithm in comparison to that of the “Deep Q Network” algorithm.

From the data available here, it can be interpreted that the actor-critic algorithm represents a significant technological advance over the now relatively old “Deep Q Network” algorithm. There is already a significant difference in performance within the first 100100 episodes, which increases over the remaining 900900 episodes.

Advantages and Disadvantages

The Actor-Critic algorithm is a popular algorithm in the field of deep reinforcement learning. The combination of value-based methods of the critic and policy-based methods of the actor results in specific advantages and disadvantages.


Efficiency and scalability

By using policy-based reinforcement learning, we can optimize high-dimensional problems more efficiently with this algorithm than with value-based methods. 2 3

Sample efficiency and stability

By using value-based reinforcement learning, we are able to achieve very good sampling efficiency and stability with this algorithm. This makes it a reliable option for reinforcement learning applications. 2

Balance between exploration and exploitation

Using the softmax activation function, we can convert a vector of numerical values into a probability distribution vector. We can then randomly draw our actions from this vector. 4 We can describe the exploration and exploitation process as follows:

  1. Exploration: By converting to a probability distribution vector, we always have a residual probability that an action will be drawn with a low probability. This encourages exploration, as the agent will try different actions over time that would not normally have been predicted by the model.

  2. Exploitation: By applying the softmax function, the actions with the highest numerical values will later have the highest probabilities. Because of this property, our method will continue to select the actions that the agent feels are best with a higher probability.


One way to improve the algorithm is through parallelization. This involves initializing multiple parallel instances of the environment, including individual actor-critic models that are periodically synchronized, often by a central model that is not trained directly on an instance and is used only for synchronization. This improvement is more stable and efficient because the data is decorrelated across instances, and thus more training information is generated than if it were generated by a single instance. 5 6

For more information, see the paper on the Asynchronous Advantage Actor Critic (A3C).


Convergence Difficulties

Convergence problems can occur with the actor-critic algorithm. In another formulation, the behavior of our agent is determined by the step size α\alpha and the subsequent update, which results from the policy gradient update function:

θθ+αJ(θ) \theta \leftarrow \theta + \alpha \nabla J(\theta)

Two types of problems can occur with such an update:

  1. Overshooting: The update misses the maximum of the reward and ends up in a region of the parameter space where we get a suboptimal result in terms of the reward.

  2. Undershooting: By using an unnecessarily small step size, we need a large number of updates to reach the optimum of our model parameters.

Overshooting is not too much of a problem in supervised learning. With constant data, the optimizer is able to correct overshooting in the next training episode. However, undershooting will also slow down convergence.

However, unlike supervised learning, overshooting in the policy gradient domain is potentially dramatic in deep reinforcement learning. If a parameter update leads to poor model behavior, it is possible that no useful information can be gained from future updates. Ultimately, this can lead to the model never improving again due to a single bad update, and thus no longer having a learning effect. 7

Problems with Convergence Guarantee

The convergence guarantee is an important aspect of reinforcement learning. The convergence guarantee ensures that the algorithm will eventually find an optimal or near optimal solution. However, convergence can be difficult for actor-critic methods for several reasons: 1

  1. Two networks: Actor-critic methods use two separate networks (the actor network and the critic network) that must be trained simultaneously. This can lead to instability, as improvements or degradations in one network can affect the performance of the other.

  2. High variance: Actor-critic methods can have a high variance in the gradient estimates. This can cause the algorithm to get stuck in local minima or fail to converge.

Hyperparameter sensitivity

The actor-critic algorithm is highly sensitive to the hyperparameters. This means that the choice of hyperparameters can have a strong influence on the result of the training and the resulting performance of the model. There are two main reasons for this:

  1. Instead of one neural network as in the Deep Q Network algorithm, two neural networks are available. This results in a significantly higher number of hyperparameters relevant for training.

In principle, an attempt can be made to train a joint neural network for Actor and Critic. The data is entered into the neural network and a separation is then made in the course of the network and two data outputs are defined: Once the probability distribution for the Actor and the regression value for the Critic. This is possible with the Tensorflow Functional API, for example, but does not eliminate the problem of more hyperparameters. It merely attempts to reduce the hyperparameter space. Furthermore, there is no guarantee that this actually leads to better results than with two different networks.

  1. Higher variance of gradients: A potentially higher variance of the gradients is achieved by the combined estimation of the updates from both the actor network and the critic network.

Overall, the hyperparameter sensitivity of the algorithm means that hyperparameter optimization must be performed more frequently than with other algorithms to achieve optimal results. 8

Lower sample efficiency

Compared to off-policy reinforcement learning algorithms, a different type of memory buffer is used here. In the memory buffer of off-policy algorithms, we can store and reuse a lot of information from the past. For on-policy algorithms, this is generally not useful because reusing old information does not necessarily improve model performance. 9

Author’s note: In my experiments with an off-policy memory buffer, the learning process even collapsed completely, so that the model stopped learning at all. This behavior can probably be explained by the policy gradient theorem.

It is important to note that despite these challenges, actor-critic methods often work well in practice and are used successfully for many tasks. There are also many variations and improvements to the basic actor-critic method that aim to reduce these problems.


In this post, we introduced a further development in the field of deep reinforcement learning: The Actor-Critic algorithm. It is characterized by using and training a policy network in addition to a value network. The advantages of this algorithm are the efficiency of the learning process and the ability to handle high-dimensional data. On the other hand, there is the sensitivity to hyperparameters or potential convergence problems that can occur and hinder the learning process of this algorithm.



  1. 2

  2. 2