# Model Evaluation

by Chee Yee Lim

Posted on 2021-03-17 Collection of notes on model evaluation - high level overview of techniques used to evaluate model performance.

## Overview

• 3 general ways to evaluate a model:
• Compare a model with data (e.g. training, testing or verification data).
• Compare a model with another fitted model.
• Manually check model predictions by human experts.
• Residuals are what is left over after fitting a model.
• In many models, the residuals are equal to the difference between the observations and the corresponding fitted values.
• $$Residuals = y - \hat{y}$$
• Residuals are useful for checking whether a model has adequately captured the information in the data.
• Expected test error can be decomposed into the following 3 components:
• $$Expected\ Test\ Error = Bias^2 + Variance + Noise$$
• Bias is the difference between the average prediction of our model and the correct value which we are trying to predict.
• $$Bias = \mathbb{E}[\hat{y}] - y$$
• Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.
• $$Var = \mathbb{E}( \hat{y} - \mathbb{E}[\hat{y}] )^2$$
• Noise is irrelevant of the underlying model and has to do with the inherent noise in the problem.
• There is a tradeoff between a model's ability to minimize bias and variance.
• High bias usually stems from oversimplifying model assumptions, while high variance from overcomplex assumptions.  ## List of Methods (ML perspective)

• For discrete outputs only
• Confusion matrix
• It lists out the number of true positives, true negatives, false positives and false negatives.
• Negative (Predicted) Positive (Predicted)
Negative (Actual) True Negative False Positive
Positive (Actual) False Negative True Positive
• Accuracy
• It is defined as the number of correct predictions divided by the total number of predictions.
• $$Accuracy = \frac{True\ Positive + True\ Negative}{True\ Positive + True\ Negative + False\ Positive + False\ Negative}$$
• For imbalanced dataset, accuracy is not a good metric. Precision and recall are better metrics in this case.
• Precision / Positive Predictive Value
• It shows how often the model is correct when its prediction is positive.
• $$Precision = \frac{True\ Positive}{True\ Positive + False\ Positive}$$
• $$Precision = \frac{True\ Positive}{Total\ Predicted\ Positive}$$
• It is a good metric to use when the cost of false positives is high.
• E.g. email spam detection.
• Recall / True Positive Rate
• It shows how well the model can find out the actual positive cases.
• $$Recall = \frac{True\ Positive}{True\ Positive + False\ Negative}$$
• $$Recall = \frac{True\ Positive}{Total\ Actual\ Positive}$$
• It is a good metric to use when the cost of false negatives is high.
• E.g. sick patient detection, fraud detection.
• F1 Score
• It is the harmonic mean or weighted average of precision and recall.
• $$F1\ Measure = 2 x \frac{Precision \times Recall}{Precision + Recall}$$
• It is used when a balance between precision and recall is needed.
• ROC / PR curves
• ROC - trade-off between TPR and FPR.
• PR - trade-off between precision and recall.
• AUROC (Area under the receiver operating characteristic)
• The area under the curve of TPR and FPR can be interpreted as the probability that the model ranks a random positive example higher than a random negative example.
• The larger the area under the curve (i.e. closer to 1), the better the model is performing.
• A model with AUROC of 0.5 is useless since its predictive accuracy is just as good as random guessing.
• For continuous outputs only
• $$R^2$$ / coefficient of determination
• $$R^2$$ value shows the total proportion of variance in the dependent variable explained by the model (i.e. independent variables).
• $$R^2 = \frac{\sum{( \hat{y}_{t} - \bar{y} )^2}}{\sum{( y_t - \bar{y} )^2}}$$
• It is also equivalent to the square of the correlation between the observed values and the predicted values.
• Value range between 0 and 1. The higher the value, the better.
• The value of $$R^2$$ will never decrease when adding an extra predictor to the model and this can lead to overfitting.
• There are no set rules for what is a good $$R^2$$, so validating a model's performance on the test data is much better than measuring $$R^2$$ on the training data.
• Explained variance
• $$Explained\ variance = 1 - \frac{Var(y-\hat{y})}{Var(y)}$$
• Best score is 1.0.
• Mean absolute error (MAE)
• MAE is just the average of predicted minus actual (i.e. error).
• $$MAE = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}-1} | y_i - \hat{y}_i |$$
• Its unit scale will be the same as the target variable.
• Mean squared error (MSE)
• MSE is the mean of the squared errors.
• $$MSE = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}-1} ( y_i - \hat{y}_i )^2$$
• Root mean squared error (RMSE)
• RMSE (variation of MSE) indicates how close the predicted values are to the actual values.
• The lower the RMSE value, the better.
• RMSE is the square root of the mean of the squared errors, $$\sqrt{ \frac{1}{n} \sum_{i=1}^n ( y_i - \hat{y}_i )^2 }$$
• Its unit scale will be the same as the target variable.

## List of Methods (Stats perspective)

• Statistics for assessing goodness-of-fit (applicable across any GLM)
• Pearson chi-square statistic $$X^2$$
• Deviance $$G^2$$
• Likelihood ratio test and statistic $$\Delta G^2$$
• Residual analysis (e.g. Pearson, deviance, adjusted residuals)
• Overdispersion (i.e. observed variance is larger than assumed variance, $$Var(Y) = \varphi \mu$$, where $$\varphi$$ is a scale parameter)
1. Can adjust for overdispersion where we estimate $$\varphi = \frac{X^2}{(N - p)}$$
2. Use another model, such as negative binomial regression to replace Poisson regression.
• Tells how much the total variance of the target outcome (i.e. dependent variable) is explained by the model (i.e. independent variables).
• $$Adjusted\ R^2 = 1 - \frac{(1-R^2)(N-1)}{N-p-1}$$, where $$N$$ is sample size and $$p$$ is the number of predictors.
• The higher the value, the better.
• As $$R^2$$ will keep on increasing with the addition of a new independent variable, adjusted $$R^2$$ corrects for this.
• Adjusted $$R^2$$ is adjusted to take into account of the number of input features used. Otherwise, the more input features there are, the higher the $$R^2$$.
• Residual standard error
• This is equivalent to the standard deviation of the residuals.
• $$\sigma_{res} = \sqrt{ \frac{ \sum (y - \hat{y})^2 }{ n-2 } }$$, where $$y$$ = observed value, $$\hat{y}$$ = estimated value, $$n$$ = number of observations.
• Likelihood ratio test
• The likelihood ratio test compares the log-likelihood of an unrestricted model (UM) with the one of a restricted model (RM) where all parameters/coefficients are set to 0.
• $$likelihood\ ratio\ test = -2(l(\beta_{RM}) - l(\beta_{UM}))$$
• The test requires both models have the same parameters and observations.
• Lower log-likelihood is better.
• Deviance test
• Deviance test compares the fit of the proposed model (PM) with the fit of a saturated model (SM).
• $$deviance\ test = -2(l(\beta_{PM}) - l(\beta_{SM}))$$
• Lower deviance is better.
• Akaike Information Criterion (AIC) / Bayesian Information Criterion (BIC)
• Both measures are based on the log-likelihood but penalize the number of parameters included in the model.
• BIC penalizes the number of parameters more heavily than the AIC.
• The model chosen by BIC will be either the same as AIC, or one with fewer terms.
• AIC is defined as:
• $$AIC = T \log \frac{SSE}{T} + 2(k+2)$$, where $$T$$ is the number of observations used for fitting model and $$k$$ is the number of predictors in the model.
• There is a form of corrected AIC, which compensates for when AIC selects too many predictors for small values of $$T$$.
• $$AIC_c = AIC + \frac{ 2(k+2)(k+3) }{ T-k-3 }$$
• For large values of $$T$$, minimizing AIC is similar to minimizing the leave-one-out cross-validated MSE value.
• AIC can also be defined in terms of likelihood:
• $$AIC = -2l(\beta) + 2p$$
• BIC is defined as:
• $$BIC = T \log \frac{SSE}{T} + (k+2) \log (T)$$, where $$T$$ is the number of observations used for fitting model and $$k$$ is the number of predictors in the model.
• For large values of $$T$$, minimizing BIC is similar to leave-v-out cross-validation when $$v = T [1 - \frac{1}{(\log (T) - 1)}]$$
• BIC can also be defined in terms of likelihood:
• $$BIC = -2l(\beta) + \log (n)p$$
• Lower AIC/BIC indicates a better model.