Model Evaluation

by Chee Yee Lim


Posted on 2021-03-17



Collection of notes on model evaluation - high level overview of techniques used to evaluate model performance.


Overview

  • 3 general ways to evaluate a model:
    • Compare a model with data (e.g. training, testing or verification data).
    • Compare a model with another fitted model.
    • Manually check model predictions by human experts.
  • Residuals are what is left over after fitting a model.
    • In many models, the residuals are equal to the difference between the observations and the corresponding fitted values.
    • \( Residuals = y - \hat{y} \)
    • Residuals are useful for checking whether a model has adequately captured the information in the data.
  • Bias-variance tradeoff
    • Expected test error can be decomposed into the following 3 components:
      • \( Expected\ Test\ Error = Bias^2 + Variance + Noise \)
    • Bias is the difference between the average prediction of our model and the correct value which we are trying to predict.
      • \( Bias = \mathbb{E}[\hat{y}] - y \)
    • Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.
      • \( Var = \mathbb{E}( \hat{y} - \mathbb{E}[\hat{y}] )^2 \)
    • Noise is irrelevant of the underlying model and has to do with the inherent noise in the problem.
    • There is a tradeoff between a model's ability to minimize bias and variance.
    • High bias usually stems from oversimplifying model assumptions, while high variance from overcomplex assumptions. Bias Variance Visualization Bias Variance Tradeoff

List of Methods (ML perspective)

  • For discrete outputs only
    • Confusion matrix
      • It lists out the number of true positives, true negatives, false positives and false negatives.
      • Negative (Predicted) Positive (Predicted)
        Negative (Actual) True Negative False Positive
        Positive (Actual) False Negative True Positive
    • Accuracy
      • It is defined as the number of correct predictions divided by the total number of predictions.
      • \( Accuracy = \frac{True\ Positive + True\ Negative}{True\ Positive + True\ Negative + False\ Positive + False\ Negative} \)
      • For imbalanced dataset, accuracy is not a good metric. Precision and recall are better metrics in this case.
    • Precision / Positive Predictive Value
      • It shows how often the model is correct when its prediction is positive.
      • \( Precision = \frac{True\ Positive}{True\ Positive + False\ Positive} \)
      • \( Precision = \frac{True\ Positive}{Total\ Predicted\ Positive} \)
      • It is a good metric to use when the cost of false positives is high.
        • E.g. email spam detection.
    • Recall / True Positive Rate
      • It shows how well the model can find out the actual positive cases.
      • \( Recall = \frac{True\ Positive}{True\ Positive + False\ Negative} \)
      • \( Recall = \frac{True\ Positive}{Total\ Actual\ Positive} \)
      • It is a good metric to use when the cost of false negatives is high.
        • E.g. sick patient detection, fraud detection.
    • F1 Score
      • It is the harmonic mean or weighted average of precision and recall.
      • \( F1\ Measure = 2 x \frac{Precision \times Recall}{Precision + Recall} \)
      • It is used when a balance between precision and recall is needed.
    • ROC / PR curves
      • ROC - trade-off between TPR and FPR.
      • PR - trade-off between precision and recall.
    • AUROC (Area under the receiver operating characteristic)
      • The area under the curve of TPR and FPR can be interpreted as the probability that the model ranks a random positive example higher than a random negative example.
      • The larger the area under the curve (i.e. closer to 1), the better the model is performing.
      • A model with AUROC of 0.5 is useless since its predictive accuracy is just as good as random guessing.
  • For continuous outputs only
    • \( R^2 \) / coefficient of determination
      • \( R^2 \) value shows the total proportion of variance in the dependent variable explained by the model (i.e. independent variables).
      • \( R^2 = \frac{\sum{( \hat{y}_{t} - \bar{y} )^2}}{\sum{( y_t - \bar{y} )^2}} \)
      • It is also equivalent to the square of the correlation between the observed values and the predicted values.
      • Value range between 0 and 1. The higher the value, the better.
      • The value of \( R^2 \) will never decrease when adding an extra predictor to the model and this can lead to overfitting.
      • There are no set rules for what is a good \( R^2 \), so validating a model's performance on the test data is much better than measuring \( R^2 \) on the training data.
    • Explained variance
      • \( Explained\ variance = 1 - \frac{Var(y-\hat{y})}{Var(y)} \)
      • Best score is 1.0.
    • Mean absolute error (MAE)
      • MAE is just the average of predicted minus actual (i.e. error).
      • \( MAE = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}-1} | y_i - \hat{y}_i | \)
      • Its unit scale will be the same as the target variable.
    • Mean squared error (MSE)
      • MSE is the mean of the squared errors.
      • \( MSE = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}-1} ( y_i - \hat{y}_i )^2 \)
    • Root mean squared error (RMSE)
      • RMSE (variation of MSE) indicates how close the predicted values are to the actual values.
      • The lower the RMSE value, the better.
      • RMSE is the square root of the mean of the squared errors, \( \sqrt{ \frac{1}{n} \sum_{i=1}^n ( y_i - \hat{y}_i )^2 } \)
      • Its unit scale will be the same as the target variable.

List of Methods (Stats perspective)

  • Statistics for assessing goodness-of-fit (applicable across any GLM)
    • Pearson chi-square statistic \( X^2 \)
    • Deviance \( G^2 \)
    • Likelihood ratio test and statistic \( \Delta G^2 \)
    • Residual analysis (e.g. Pearson, deviance, adjusted residuals)
    • Overdispersion (i.e. observed variance is larger than assumed variance, \( Var(Y) = \varphi \mu \), where \( \varphi \) is a scale parameter)
      1. Can adjust for overdispersion where we estimate \( \varphi = \frac{X^2}{(N - p)} \)
      2. Use another model, such as negative binomial regression to replace Poisson regression.
  • Adjusted R-squared
    • Tells how much the total variance of the target outcome (i.e. dependent variable) is explained by the model (i.e. independent variables).
    • \( Adjusted\ R^2 = 1 - \frac{(1-R^2)(N-1)}{N-p-1} \), where \( N \) is sample size and \( p \) is the number of predictors.
    • The higher the value, the better.
    • As \( R^2 \) will keep on increasing with the addition of a new independent variable, adjusted \( R^2 \) corrects for this.
    • Adjusted \( R^2 \) is adjusted to take into account of the number of input features used. Otherwise, the more input features there are, the higher the \( R^2 \).
  • Residual standard error
    • This is equivalent to the standard deviation of the residuals.
    • \( \sigma_{res} = \sqrt{ \frac{ \sum (y - \hat{y})^2 }{ n-2 } } \), where \( y \) = observed value, \( \hat{y} \) = estimated value, \( n \) = number of observations.
  • Likelihood ratio test
    • The likelihood ratio test compares the log-likelihood of an unrestricted model (UM) with the one of a restricted model (RM) where all parameters/coefficients are set to 0.
    • \( likelihood\ ratio\ test = -2(l(\beta_{RM}) - l(\beta_{UM})) \)
    • The test requires both models have the same parameters and observations.
    • Lower log-likelihood is better.
  • Deviance test
    • Deviance test compares the fit of the proposed model (PM) with the fit of a saturated model (SM).
    • \( deviance\ test = -2(l(\beta_{PM}) - l(\beta_{SM})) \)
    • Lower deviance is better.
  • Akaike Information Criterion (AIC) / Bayesian Information Criterion (BIC)
    • Both measures are based on the log-likelihood but penalize the number of parameters included in the model.
    • BIC penalizes the number of parameters more heavily than the AIC.
      • The model chosen by BIC will be either the same as AIC, or one with fewer terms.
    • AIC is defined as:
      • \( AIC = T \log \frac{SSE}{T} + 2(k+2) \), where \( T \) is the number of observations used for fitting model and \( k \) is the number of predictors in the model.
    • There is a form of corrected AIC, which compensates for when AIC selects too many predictors for small values of \( T \).
      • \( AIC_c = AIC + \frac{ 2(k+2)(k+3) }{ T-k-3 } \)
    • For large values of \( T \), minimizing AIC is similar to minimizing the leave-one-out cross-validated MSE value.
    • AIC can also be defined in terms of likelihood:
      • \( AIC = -2l(\beta) + 2p \)
    • BIC is defined as:
      • \( BIC = T \log \frac{SSE}{T} + (k+2) \log (T) \), where \( T \) is the number of observations used for fitting model and \( k \) is the number of predictors in the model.
    • For large values of \( T \), minimizing BIC is similar to leave-v-out cross-validation when \( v = T [1 - \frac{1}{(\log (T) - 1)}] \)
    • BIC can also be defined in terms of likelihood:
      • \( BIC = -2l(\beta) + \log (n)p \)
    • Lower AIC/BIC indicates a better model.