Model Hyperparameter Tuning
by Chee Yee Lim
Posted on 20210327
Collection of notes on model hyperparameter tuning  high level overview of techniques used to tune models.
 In an optimisation problem regarding model's hyperparameters, the aim is to identify \( x^* = argmin_x f(x) \), where \( f \) represents the objective function.
 An optimisation algorithm usually has a tradeoff between exploration and exploitation.
 Exploitation means to pick the next points most likely to give better values; exploration means to pick the next point in an area we have less explored.
 Algorithm will emphasise on exploration early on, before focusing on exploitation.
 Both modelfree and modelbased methods are usually combined with cross validation.
 The relationship between loss value and parameter value is usually not known.
 Why is hyperparameter tuning difficult?
 Model evaluation is very computationally expensive.
 The configuration space consists of high dimensional mixture of continuous, categorical and conditional hyperparameters.
 It is unclear which hyperparameters need tuning and in what range.
 The hyperparameter gradients may not be available depending on the type of model.

Modelfree methods
 Grid search
 Randomised search
 Less costly. Very useful if the search space is big and computational resource is limited.
 E.g. Python packages: pycma (CMAES)
 Populationbased search (e.g. genetic algorithms, particle swarm optimisation)
 Conceptually simple and easy to parallelise.

Modelbased methods
 Bayesian optimisation
 Bayesian optimisation is an iterative algorithm with two key ingredients: a probabilistic surrogate model and an acquisition function to decide which point to evaluate next.
 In each iteration, the surrogate model is fitted to all observations of the target function made so far. Then the acquisition function, which uses the predictive distribution of the probabilistic model, determines the utility of different candidate points, trading off exploration and exploitation.
 Traditionally, Bayesian optimisation uses Gaussian processes to model the target function.
 Random forest (used by SMAC) and neural network can be used to replace Gaussian process as the surrogate model.
 E.g. Python packages: hyperopt (based on TPE)
 Multifidelity methods
 It is an extension of Bayesian optimisation using multiple low and/or highfidelity approximations. Cheaper computational costs due to the use of approximations.
 Multiple low and/or highfidelity can be achieved by using data subsets, feature subsets, image downsampling and/or short runs of iterative algorithms.
 Sequential Modelbased Optimisation (SMBO)
 SMBO is the most common formalization of Bayesian optimisation.
 SMBO tries to run trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating the probability surrogate model.
 Tree Parzen Estimators (TPE)
 TPE creates two hierarchical processes, \( l(x) \) and \( g(x) \) acting as generative models for all hyperparameter variables.
 These processes model the hyperparameter variables separately when the objective function is below and above a specified quantile \( y^* \).
 \( p(xy) =
\begin{cases}
l(x) &\text{if $y < y^*$} \\\
g(x) &\text{if $y \geq y^*$}
\end{cases}
\), where \( l(x) \) is the density formed by using the observations \( x^i \) such that corresponding loss \( f(x^i) \) was less than \( y^* \) and \( g(x) \) is the density formed by using the remaining observations.
 TPE aims to maximise an acquisition function called Expected Improvement.
 TPE works by drawing sample hyperparameters from \( l(x) \), evaluating them in terms of \( \frac{l(x)}{g(x)} \), and returning the set that yields the highest value under \( \frac{l(x)}{g(x)} \) corresponding to the greatest expected improvement.
 Advantages
 Faster runtime. Scale linearly in number of variables being optimized.
 Supports tuning conditional hyperparameters.
 Can tune both continuous and discrete variables.