Model Evaluation
by Chee Yee Lim
Posted on 20210317
Collection of notes on model evaluation  high level overview of techniques used to evaluate model performance.
 3 general ways to evaluate a model:
 Compare a model with data (e.g. training, testing or verification data).
 Compare a model with another fitted model.
 Manually check model predictions by human experts.
 Residuals are what is left over after fitting a model.
 In many models, the residuals are equal to the difference between the observations and the corresponding fitted values.
 \( Residuals = y  \hat{y} \)
 Residuals are useful for checking whether a model has adequately captured the information in the data.
 Biasvariance tradeoff
 Expected test error can be decomposed into the following 3 components:
 \( Expected\ Test\ Error = Bias^2 + Variance + Noise \)
 Bias is the difference between the average prediction of our model and the correct value which we are trying to predict.
 \( Bias = \mathbb{E}[\hat{y}]  y \)
 Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.
 \( Var = \mathbb{E}( \hat{y}  \mathbb{E}[\hat{y}] )^2 \)
 Noise is irrelevant of the underlying model and has to do with the inherent noise in the problem.
 There is a tradeoff between a model's ability to minimize bias and variance.
 High bias usually stems from oversimplifying model assumptions, while high variance from overcomplex assumptions.
 For discrete outputs only
 Confusion matrix
 It lists out the number of true positives, true negatives, false positives and false negatives.


Negative (Predicted) 
Positive (Predicted) 
Negative (Actual) 
True Negative 
False Positive 
Positive (Actual) 
False Negative 
True Positive 
 Accuracy
 It is defined as the number of correct predictions divided by the total number of predictions.
 \( Accuracy = \frac{True\ Positive + True\ Negative}{True\ Positive + True\ Negative + False\ Positive + False\ Negative} \)
 For imbalanced dataset, accuracy is not a good metric. Precision and recall are better metrics in this case.
 Precision / Positive Predictive Value
 It shows how often the model is correct when its prediction is positive.
 \( Precision = \frac{True\ Positive}{True\ Positive + False\ Positive} \)
 \( Precision = \frac{True\ Positive}{Total\ Predicted\ Positive} \)
 It is a good metric to use when the cost of false positives is high.
 E.g. email spam detection.
 Recall / True Positive Rate
 It shows how well the model can find out the actual positive cases.
 \( Recall = \frac{True\ Positive}{True\ Positive + False\ Negative} \)
 \( Recall = \frac{True\ Positive}{Total\ Actual\ Positive} \)
 It is a good metric to use when the cost of false negatives is high.
 E.g. sick patient detection, fraud detection.
 F1 Score
 It is the harmonic mean or weighted average of precision and recall.
 \( F1\ Measure = 2 x \frac{Precision \times Recall}{Precision + Recall} \)
 It is used when a balance between precision and recall is needed.
 ROC / PR curves
 ROC  tradeoff between TPR and FPR.
 PR  tradeoff between precision and recall.
 AUROC (Area under the receiver operating characteristic)
 The area under the curve of TPR and FPR can be interpreted as the probability that the model ranks a random positive example higher than a random negative example.
 The larger the area under the curve (i.e. closer to 1), the better the model is performing.
 A model with AUROC of 0.5 is useless since its predictive accuracy is just as good as random guessing.
 For continuous outputs only
 \( R^2 \) / coefficient of determination
 \( R^2 \) value shows the total proportion of variance in the dependent variable explained by the model (i.e. independent variables).
 \( R^2 = \frac{\sum{( \hat{y}_{t}  \bar{y} )^2}}{\sum{( y_t  \bar{y} )^2}} \)
 It is also equivalent to the square of the correlation between the observed values and the predicted values.
 Value range between 0 and 1. The higher the value, the better.
 The value of \( R^2 \) will never decrease when adding an extra predictor to the model and this can lead to overfitting.
 There are no set rules for what is a good \( R^2 \), so validating a model's performance on the test data is much better than measuring \( R^2 \) on the training data.
 Explained variance
 \( Explained\ variance = 1  \frac{Var(y\hat{y})}{Var(y)} \)
 Best score is 1.0.
 Mean absolute error (MAE)
 MAE is just the average of predicted minus actual (i.e. error).
 \( MAE = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}1}  y_i  \hat{y}_i  \)
 Its unit scale will be the same as the target variable.
 Mean squared error (MSE)
 MSE is the mean of the squared errors.
 \( MSE = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}1} ( y_i  \hat{y}_i )^2 \)
 Root mean squared error (RMSE)
 RMSE (variation of MSE) indicates how close the predicted values are to the actual values.
 The lower the RMSE value, the better.
 RMSE is the square root of the mean of the squared errors, \( \sqrt{ \frac{1}{n} \sum_{i=1}^n ( y_i  \hat{y}_i )^2 } \)
 Its unit scale will be the same as the target variable.
 Statistics for assessing goodnessoffit (applicable across any GLM)
 Pearson chisquare statistic \( X^2 \)
 Deviance \( G^2 \)
 Likelihood ratio test and statistic \( \Delta G^2 \)
 Residual analysis (e.g. Pearson, deviance, adjusted residuals)
 Overdispersion (i.e. observed variance is larger than assumed variance, \( Var(Y) = \varphi \mu \), where \( \varphi \) is a scale parameter)
 Can adjust for overdispersion where we estimate \( \varphi = \frac{X^2}{(N  p)} \)
 Use another model, such as negative binomial regression to replace Poisson regression.
 Adjusted Rsquared
 Tells how much the total variance of the target outcome (i.e. dependent variable) is explained by the model (i.e. independent variables).
 \( Adjusted\ R^2 = 1  \frac{(1R^2)(N1)}{Np1} \), where \( N \) is sample size and \( p \) is the number of predictors.
 The higher the value, the better.
 As \( R^2 \) will keep on increasing with the addition of a new independent variable, adjusted \( R^2 \) corrects for this.
 Adjusted \( R^2 \) is adjusted to take into account of the number of input features used. Otherwise, the more input features there are, the higher the \( R^2 \).
 Residual standard error
 This is equivalent to the standard deviation of the residuals.
 \( \sigma_{res} = \sqrt{ \frac{ \sum (y  \hat{y})^2 }{ n2 } } \), where \( y \) = observed value, \( \hat{y} \) = estimated value, \( n \) = number of observations.
 Likelihood ratio test
 The likelihood ratio test compares the loglikelihood of an unrestricted model (UM) with the one of a restricted model (RM) where all parameters/coefficients are set to 0.
 \( likelihood\ ratio\ test = 2(l(\beta_{RM})  l(\beta_{UM})) \)
 The test requires both models have the same parameters and observations.
 Lower loglikelihood is better.
 Deviance test
 Deviance test compares the fit of the proposed model (PM) with the fit of a saturated model (SM).
 \( deviance\ test = 2(l(\beta_{PM})  l(\beta_{SM})) \)
 Lower deviance is better.
 Akaike Information Criterion (AIC) / Bayesian Information Criterion (BIC)
 Both measures are based on the loglikelihood but penalize the number of parameters included in the model.
 BIC penalizes the number of parameters more heavily than the AIC.
 The model chosen by BIC will be either the same as AIC, or one with fewer terms.
 AIC is defined as:
 \( AIC = T \log \frac{SSE}{T} + 2(k+2) \), where \( T \) is the number of observations used for fitting model and \( k \) is the number of predictors in the model.
 There is a form of corrected AIC, which compensates for when AIC selects too many predictors for small values of \( T \).
 \( AIC_c = AIC + \frac{ 2(k+2)(k+3) }{ Tk3 } \)
 For large values of \( T \), minimizing AIC is similar to minimizing the leaveoneout crossvalidated MSE value.
 AIC can also be defined in terms of likelihood:
 \( AIC = 2l(\beta) + 2p \)
 BIC is defined as:
 \( BIC = T \log \frac{SSE}{T} + (k+2) \log (T) \), where \( T \) is the number of observations used for fitting model and \( k \) is the number of predictors in the model.
 For large values of \( T \), minimizing BIC is similar to leavevout crossvalidation when \( v = T [1  \frac{1}{(\log (T)  1)}] \)
 BIC can also be defined in terms of likelihood:
 \( BIC = 2l(\beta) + \log (n)p \)
 Lower AIC/BIC indicates a better model.