Classification
When working with neural networks there are several difficulties to achieve good performance. The performance of a model - however this is defined - depends here on different factors, for example the quantity of information which is used, the frequency of the training (epochs) or also the architecture and parameters of the model. Especially for the architecture and parameters of the model hyperparameter optimization is used, because the search space is too large to search it with a classical grid search. 1
Basics
What is the goal of hyperparameter optimization?
The goal of our hyperparameter optimization is to find a configuration of parameters that provides an optimal configuration of the model in terms of performance. However, we will often find only a good configuration.
We usually only find a good configuration because neural network optimization is an incredibly complex and mathematically non-linear task. This is still better than no optimization; more time in optimization will usually still yield slightly better results.
Implementation of a hyperparameter optimization
We will now implement two examples that use the optimization library Optuna. It has a large number of features which are important for fast and efficient optimizations (for example Distributed Optimization), as well as the visualization of these results. Optuna is also compatible with all state-of-the art machine learning libraries like Tensorflow (including Keras) or PyTorch.
Hyperparameter optimization of a DQN’s
A fundamentally more complex example (which tends to be the case in Machine Learning) would be the hyperparameter optimization of a Deep Q-Network. The basis for this is the implementation of the Deep Q-Network as presented a week ago. The imports in this case are as follows:
Joblib is a library for storing the studies from Optuna iterations.
It then defines a set of global variables that contain information for various functions:
Then the objective
function is defined:
Several things happen in this function:
- initialization of the agent and seeds
- training of the agent
- evaluation of the agent using
evaluation=True
.
The seed is a value for the initialization of a pseudo-random number generator as used in almost all computer applications. This also applies for Gym, Tensorflow and Numpy.
According to this function, hyperparameters are selected which (under the seed) give the best results. It would be possible to remove the seed to potentially get different results.
Then, another function can be defined to train the configuration with the best parameters without seed. Here, several agents are trained and the best iteration is returned.
Finally, the initialization of hyperparameter optimization and caching of results is still missing:
A buffering of results may be necessary, because a hyperparameter optimization may take a very long time. A
try-catch
block was set up here to catch errors under certain circumstances and to be able to read them out. This is especially advantageous when running on virtual machines like Google Colab in non-interactive mode.
The number of maximum trials per iteration and number of iterations are very large here. A realistic search will classically not need so many iterations for a long time, depening on what form of results are defined as acceptable. A hypothesis here could also be 200 iterations in total.
Visualization of results
For the visualization of optimization results Optuna has several functionalities, which run under the module optuna.visualization
.
An output of the best value and the corresponding hyperparameters is possible in a very simple way:
There are a number of visualizations which are very useful. These are demonstrated below by way of example, using a study with a total of 100 iterations.
plot_optimization_history
:
plot_param_importances
plot_contour
:
Various parameters from the definition of the hyperparameters can be used here. The number of parameters is limited to exactly two parameters.
Alternatively, all parameters can be visualized in a plot by not using the
params
argument.
Further possibilities of Optuna
Sampler
The optimizations that have been started so far were good in terms of results, but not yet optimal. This is partly due to the fact that so far only the standard sampler of Optuna was used, which was not necessarily adapted to the problem.
A sampler is an algorithm that is responsible for suggesting the parameters of the current iteration.
Using a more adapted sampler may improve the efficiency of the optimization. A set of implemented samplers can be seen in the picture below.
-
✅: Feature is supported.
-
▲: Works, but inefficiently.
-
❌: Buggy or does not have an appropriate interface.
-
: Dimension of search space.
-
: Number of completed trials
-
: Number of targets to be optimized
-
: Size of the population (algorithm-specific).
It can be seen that, for example, the TPE sampler is a good sampler for the implemented problems. The documentation uses this as the default sampler. 2
Pruner
It is possible to use several different pruning algorithms to increase (adapted to the sampler) the time efficiency of the optimization. More about this here.
A pruner is an algorithm which is responsible for the early termination of iterations that are not very promising.
Constrained optimization
It is also possible to perform optimization on restricted sets or domains. There is an example for this here.
Further visualizations
Optuna is not only a good platform for optimization of complex problems, but also has its own visualization environment equivalent to Tensorboard. This can be found here.
Further information
Basically, this introduction aims at Reinforcement Learning, but it can be applied equivalently to Supervised Learning. For Supervised Learning exist however still another set of optimization algorithms, which are (partially) more efficient there. A classical and simple example is the Keras Hyperband Tuner.
Furthermore, a number of other hyperparameter optimization libraries exist, for example Hydra or many others. 1
Changes
- [27.03.2023] Added interactive plots and removed less relevant plots to keep focus on the important plots.