Model Hyperparameter Tuning

by Chee Yee Lim


Posted on 2021-03-27



Collection of notes on model hyperparameter tuning - high level overview of techniques used to tune models.


Hyperparameter Tuning

Overview

  • In an optimisation problem regarding model's hyperparameters, the aim is to identify \( x^* = argmin_x f(x) \), where \( f \) represents the objective function.
  • An optimisation algorithm usually has a trade-off between exploration and exploitation.
    • Exploitation means to pick the next points most likely to give better values; exploration means to pick the next point in an area we have less explored.
    • Algorithm will emphasise on exploration early on, before focusing on exploitation.
  • Both model-free and model-based methods are usually combined with cross validation.
  • The relationship between loss value and parameter value is usually not known.
    • Loss Value vs Parameter Value
  • Why is hyperparameter tuning difficult?
    • Model evaluation is very computationally expensive.
    • The configuration space consists of high dimensional mixture of continuous, categorical and conditional hyperparameters.
    • It is unclear which hyperparameters need tuning and in what range.
    • The hyperparameter gradients may not be available depending on the type of model.

List of Methods

  • Model-free methods

    • Grid search
      • Exhaustive search
    • Randomised search
      • Less costly. Very useful if the search space is big and computational resource is limited.
      • E.g. Python packages: pycma (CMA-ES)
    • Population-based search (e.g. genetic algorithms, particle swarm optimisation)
      • Conceptually simple and easy to parallelise.
  • Model-based methods

    • Bayesian optimisation
      • Bayesian optimisation is an iterative algorithm with two key ingredients: a probabilistic surrogate model and an acquisition function to decide which point to evaluate next.
      • In each iteration, the surrogate model is fitted to all observations of the target function made so far. Then the acquisition function, which uses the predictive distribution of the probabilistic model, determines the utility of different candidate points, trading off exploration and exploitation.
      • Traditionally, Bayesian optimisation uses Gaussian processes to model the target function.
      • Random forest (used by SMAC) and neural network can be used to replace Gaussian process as the surrogate model.
      • E.g. Python packages: hyperopt (based on TPE)
    • Multi-fidelity methods
      • It is an extension of Bayesian optimisation using multiple low and/or high-fidelity approximations. Cheaper computational costs due to the use of approximations.
      • Multiple low and/or high-fidelity can be achieved by using data subsets, feature subsets, image downsampling and/or short runs of iterative algorithms.

Bayesian Optimisation Algorithms

  • Sequential Model-based Optimisation (SMBO)
    • SMBO is the most common formalization of Bayesian optimisation.
    • SMBO tries to run trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating the probability surrogate model.
  • Tree Parzen Estimators (TPE)
    • TPE creates two hierarchical processes, \( l(x) \) and \( g(x) \) acting as generative models for all hyperparameter variables.
    • These processes model the hyperparameter variables separately when the objective function is below and above a specified quantile \( y^* \).
    • \( p(x|y) = \begin{cases} l(x) &\text{if $y < y^*$} \\\ g(x) &\text{if $y \geq y^*$} \end{cases} \), where \( l(x) \) is the density formed by using the observations \( x^i \) such that corresponding loss \( f(x^i) \) was less than \( y^* \) and \( g(x) \) is the density formed by using the remaining observations.
    • TPE aims to maximise an acquisition function called Expected Improvement.
    • TPE works by drawing sample hyperparameters from \( l(x) \), evaluating them in terms of \( \frac{l(x)}{g(x)} \), and returning the set that yields the highest value under \( \frac{l(x)}{g(x)} \) corresponding to the greatest expected improvement.
    • Advantages
      • Faster runtime. Scale linearly in number of variables being optimized.
      • Supports tuning conditional hyperparameters.
      • Can tune both continuous and discrete variables.