Graphics from Deepmind – https://unsplash.com/@deepmind

Hyperparameter optimization for neural networks

How exactly does one perform hyperparameter optimization? What do you have to pay attention to there? What exactly does a visualization look like? There is an answer to all these questions here.

Henrik Bartsch

Henrik Bartsch

Classification

When working with neural networks there are several difficulties to achieve good performance. The performance of a model - however this is defined - depends here on different factors, for example the quantity of information which is used, the frequency of the training (epochs) or also the architecture and parameters of the model. Especially for the architecture and parameters of the model hyperparameter optimization is used, because the search space is too large to search it with a classical grid search. 1

Basics

What is the goal of hyperparameter optimization?

The goal of our hyperparameter optimization is to find a configuration of parameters that provides an optimal configuration of the model in terms of performance. However, we will often find only a good configuration.

We usually only find a good configuration because neural network optimization is an incredibly complex and mathematically non-linear task. This is still better than no optimization; more time in optimization will usually still yield slightly better results.

Implementation of a hyperparameter optimization

We will now implement two examples that use the optimization library Optuna. It has a large number of features which are important for fast and efficient optimizations (for example Distributed Optimization), as well as the visualization of these results. Optuna is also compatible with all state-of-the art machine learning libraries like Tensorflow (including Keras) or PyTorch.

Hyperparameter optimization of a DQN’s

A fundamentally more complex example (which tends to be the case in Machine Learning) would be the hyperparameter optimization of a Deep Q-Network. The basis for this is the implementation of the Deep Q-Network as presented a week ago. The imports in this case are as follows:

hyperparameter.py
import numpy as np
import tensorflow as tf

import gym
import optuna
import joblib
import sys
import datetime

from os.path import exists
from dqn_agent import DQNAgent, training_loop, evaluation_run

Joblib is a library for storing the studies from Optuna iterations.

It then defines a set of global variables that contain information for various functions:

hyperparameter.py
max_frames_episode, avg_length = 500, 50
n_episodes = 5000
env_name = "CartPole-v1"
opt_rounds, final_rounds = 50, 150

Then the objective function is defined:

hyperparameter.py
def objective(trial):
  env = gym.make(env_name)

  n_actions = env.action_space.n
  observation_shape = env.observation_space.shape[0]

  agent = DQNAgent(observation_shape, n_actions, trial)

  env.seed(69)
  tf.random.set_seed(69)
  np.random.seed(69)

  for current_episode in range(n_episodes):
      episode_reward, agent = training_loop(env, agent, max_frames_episode)

      # Write episodic score to tensorboard database
      with agent.score_writer.as_default():
        tf.summary.scalar('Episodic Score', episode_reward, step=current_episode)

      agent.train()

  # Criterion to maximize: Mean score over a specified amount of episodes (true evaluation, without epsilon-greedy)
  scores = np.zeros((opt_rounds,))

  for i in range(opt_rounds):
      scores[i], agent = evaluation_run(env, agent, max_frames_episode)

      # Write evaluation score to tensorboard database
      with agent.score_writer.as_default():
          tf.summary.scalar('Evaluation Score', scores[i], step=i)

  return np.mean(scores)

Several things happen in this function:

  1. initialization of the agent and seeds
  2. training of the agent
  3. evaluation of the agent using evaluation=True.

The seed is a value for the initialization of a pseudo-random number generator as used in almost all computer applications. This also applies for Gym, Tensorflow and Numpy.

According to this function, hyperparameters are selected which (under the seed) give the best results. It would be possible to remove the seed to potentially get different results.

Then, another function can be defined to train the configuration with the best parameters without seed. Here, several agents are trained and the best iteration is returned.

hyperparameter.py
def detailed_objective(trial):
  env = gym.make(env_name)

  n_actions = env.action_space.n
  observation_shape = env.observation_space.shape[0]
  agent = DQNAgent(observation_shape, n_actions, trial)

  file.write("Starting Training of Trial {} ... \n".format(trial.number))
  for current_episode in range(n_episodes):
      episode_reward, agent = training_loop(env, agent, max_frames_episode)

      # Write episodic score to tensorboard database
      with agent.score_writer.as_default():
        tf.summary.scalar('Episodic Score', episode_reward, step=current_episode)

      agent.train()

  # Criterion to maximize: Mean score over a specified amount of episodes (true evaluation, without epsilon-greedy)
  scores = np.zeros((final_rounds,))

  for i in range(final_rounds):
      scores[i], agent = evaluation_run(env, agent, max_frames_episode)

      # Write evaluation score to tensorboard database
      with agent.score_writer.as_default():
          tf.summary.scalar('Evaluation Score', scores[i], step=i)

  return np.mean(scores), agent # Also return agent to save if this agent has highest score

Finally, the initialization of hyperparameter optimization and caching of results is still missing:

hyperparameter.py
if __name__ == "__main__":
  iteration = sys.argv[^medium]

  study_name, study_path = "DQN_CartPole", "study_dqn_cartpole{}.pkl".format(iteration)
  number_trials, amount_iterations = 5, 100
  study = None

  try:
    for i in range(amount_iterations):
      if (exists(study_path) == False):
        study = optuna.create_study(study_name=study_name, direction="maximize")
      else:
        study = joblib.load(study_path)

      study.optimize(objective, n_trials=number_trials)
      joblib.dump(study, study_path)

    trial = study.best_trial

    print("Mean Score of the best Trial: ", trial.value)
    print("Parameters of the best Trial: ")
    for key, value in trial.params.items():
      print("   {}: {}".format(key, value))

    # Train another set of networks with the best hyperparameters and select the network with the highest average score
    num_iterations = 10
    agents, rewards = [], []
    for i in range(num_iterations):
      score, agent = detailed_objective(trial) # Train with best trial
      rewards.append(score)
      agents.append(agent)

    rewards = np.array(rewards)
    best_agent_v = np.argmax(rewards)

    best_agent = agents[best_agent_v]
    best_agent.model.save_model("optimization/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))

  except Exception as e:
    logname = "log{}.txt".format(iteration)

    if (exists(logname) == False):
      f = open(logname, "w")
    else:
      f = open(logname, "a")
    file = f
    
    file.write(str(e))
  file.close()

A buffering of results may be necessary, because a hyperparameter optimization may take a very long time. A try-catch block was set up here to catch errors under certain circumstances and to be able to read them out. This is especially advantageous when running on virtual machines like Google Colab in non-interactive mode.

The number of maximum trials per iteration and number of iterations are very large here. A realistic search will classically not need so many iterations for a long time, depening on what form of results are defined as acceptable. A hypothesis here could also be 200 iterations in total.

Visualization of results

For the visualization of optimization results Optuna has several functionalities, which run under the module optuna.visualization.

hyperparameter.py
import optuna
import joblib

study = joblib.load("study_dqn_cartpole.pkl")

An output of the best value and the corresponding hyperparameters is possible in a very simple way:

hyperparameter.py
print("---Evaluation: Best Trial---")

trial = study1.best_trial
print("Mean Score of the best Trial from study: ", trial.value)
print("Parameters of the best Trial: ")
for key, value in trial.params.items():
    print("   {}: {}".format(key, value))

There are a number of visualizations which are very useful. These are demonstrated below by way of example, using a study with a total of 100 iterations.

  1. plot_optimization_history:
hyperparameter.py
from optuna.visualization import plot_optimization_history

plot_optimization_history(study)
  1. plot_param_importances
hyperparameter.py
from optuna.visualization import plot_param_importances

plot_param_importances(study)
  1. plot_contour:
hyperparameter.py
from optuna.visualization import plot_contour

plot_contour(study, params=["Gamma Parameter", "Epsilon Parameter"])

Various parameters from the definition of the hyperparameters can be used here. The number of parameters is limited to exactly two parameters.

Alternatively, all parameters can be visualized in a plot by not using the params argument.

Further possibilities of Optuna

Sampler

The optimizations that have been started so far were good in terms of results, but not yet optimal. This is partly due to the fact that so far only the standard sampler of Optuna was used, which was not necessarily adapted to the problem.

A sampler is an algorithm that is responsible for suggesting the parameters of the current iteration.

Using a more adapted sampler may improve the efficiency of the optimization. A set of implemented samplers can be seen in the picture below.

Optuna - Samplers

  1. ✅: Feature is supported.

  2. ▲: Works, but inefficiently.

  3. ❌: Buggy or does not have an appropriate interface.

  4. dd: Dimension of search space.

  5. nn: Number of completed trials

  6. mm: Number of targets to be optimized

  7. pp: Size of the population (algorithm-specific).

It can be seen that, for example, the TPE sampler is a good sampler for the implemented problems. The documentation uses this as the default sampler. 2

Pruner

It is possible to use several different pruning algorithms to increase (adapted to the sampler) the time efficiency of the optimization. More about this here.

A pruner is an algorithm which is responsible for the early termination of iterations that are not very promising.

Constrained optimization

It is also possible to perform optimization on restricted sets or domains. There is an example for this here.

Further visualizations

Optuna is not only a good platform for optimization of complex problems, but also has its own visualization environment equivalent to Tensorboard. This can be found here.

Further information

Basically, this introduction aims at Reinforcement Learning, but it can be applied equivalently to Supervised Learning. For Supervised Learning exist however still another set of optimization algorithms, which are (partially) more efficient there. A classical and simple example is the Keras Hyperband Tuner.

Furthermore, a number of other hyperparameter optimization libraries exist, for example Hydra or many others. 1

Changes

  1. [27.03.2023] Added interactive plots and removed less relevant plots to keep focus on the important plots.

Sources

Footnotes

  1. medium.com 2

  2. optuna.readthedocs.io