by Chee Yee Lim

Posted on 2021-03-17

Collection of notes on model interpretation - high level overview of techniques used to interpret model prediction.

- Interpretability is the degree to which a human can understand the cause of a decision.
- Why interpretability for machine learning is important
- For certain problems, it is not enough to get the prediction right. The model must also explain how it came to the prediction.
- Interpretability is especially important in a high risk environment.
- For example, it may not be important to explain why a recommender system is recommending a set of songs to an user. But it is very important for a model to explain why it is suggesting a patient is suffering from a specific disease.

- Methods to make machine learning interpretable
- (Intrinsic) Use interpretable models, such as linear, logistic regressions and decision trees.
- (Post-hoc) Use post-hoc model-agnostic interpretation tools that can be applied to any supervised models.
- Model-agnostic methods work by changing the input of the machine learning model and measuring changes in the prediction output.
- Methods that can explain global model behaviours are feature importance etc; methods that can explain individual predictions are Shapley values etc.

- The advantage of using post-hoc approach is its flexibility. This means that users are free to use any machine learning model that best solves a task, and can use any interpretation method on the model.
- There is typically an inverse relationship between explainability and model performance.
- This is especially true for intrinsically interpretable models, which tend to have weaker performance.

- Linear regression
- The linearity aspect of the learned relationship is what makes the interpretation easy. (although this is also limiting the power of the model)
- The weighted of each feature tells us their importance. Estimated weights also come with confidence intervals.
- The intercept \( \beta_0 \) only has meaning when the features have been standardised (mean of zero, standard deviation of one). In this case the intercept reflects the predicted outcome of an instance where all features are at their mean value.
- The importance of a feature in a linear regression model can be measured by the absolute value of its t-statistic.
- \( t_{\hat{\beta}_j} = \frac{ \hat{\beta}_j }{ SE(\hat{\beta}_j) } \)
- The formula tells us that a feature is more important if it has higher weight but lower variance in the weight.

- Weight plot or effect plot can be used to visualise linear regression model.
- Weight plot is the plot of estimated weights of all features.
- Effect plot is the plot of estimated weights multiply by feature values.

- Logistic regression
- The interpretation of the weights in logistic regression differs from the interpretation of the weights in linear regresison, since the outcome in logistic regression is a probability between 0 and 1.
- The weights do not influence the probability linearly due to the logistic function.
- Logistic regression is defined as:
- \( P(y=1) = \frac{1}{ 1 + \exp ( -( \beta_0 + \beta_1 x_1 + ... + \beta_p x_p) ) } \)

- To better interpret logistic regression, we need to reformulate the equation such that linear terms are on the right side of the formula.
- \( \log ( \frac{P(y=1)}{1 - P(y=1)} ) = \log ( odds ) = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p \)
- This is known as the log odds (LHS of the formula) representation.

- Generalized linear model (GLM)
- For outcomes that are not Gaussian distributed, GLM can be used.
- The core concept of any GLM is to keep the weighted sum of the features, but allow non-Gaussian outcome distributions by connecting the expected mean of this distribution and the weighted sum through a possibly nonlinear function (i.e. link function).
- The assumed distribution together with the link function determines how the estimated weights are interpreted.

- Decision tree
- To interpret decision tree, just plot it out. Then interpret starting from the root node, go to the next nodes which will tell us which feature and threshold is used to separate data into further subsets.
- The feature importance of a feature in a decision tree can be computed by checking how much has a feature reduced the variance/Gini index over all the splits.
- The sum of all importances is then scaled to 100.

- Decision trees are very interpretable, as long as they are short. The number of terminal nodes increases quickly with depth.

- Naive Bayes classifier
- For each feature, Naive Bayes classifier calculates the probability for a class depending on the value of the feature.
- This is done for each feature independently, which is equivalent to a strong (i.e. naive) assumption of conditional independence of the features.

- Naive Bayes is an interpretable model because of the independence assumption, which makes it easy to understand how much each feature contributes to a prediction.

- For each feature, Naive Bayes classifier calculates the probability for a class depending on the value of the feature.
- k-nearest neighbours
- k-nearest does not learn any parameter and it learns only local patterns.
- The only way to interpret the model is to look at the k neighbours used for a prediction, and trying to understand them.

- Partial dependence plot (PDP)
- PDP shows the marginal effect one or two features have on the predicted outcome of a machine learning model.
- PDP can show whether the relationship between the target and a feature is linear, monotonic or more complex.
- PDP works by marginalizing the machine learning model output over the distribution of the features in set C, so that the function shows the relationship between the features in set S that we are interested in and the predicted outcome.
- PDP is defined as:
- \( PDP_S = \frac{1}{n} \sum_{i=1}^n \hat{f} (x_S, x_{Ci}) \), where \( \hat{f} \) is the machine learning model, \( x_S \) are features of interest, \( x_C \) are other features.
- PDP is estimated by calculating averages in the training data (i.e. Monte Carlo method).

- PDP at a particular feature value represents the average prediction if we force all data points to assume that feature value.
- Pros:
- Easy to understand for lay people.
- Interpretation is clear (for uncorrelated features).
- Easy to implement.
- Has causal interpretation w.r.t. model (not real world).

- Cons:
- Maximum number of features to be examined is 2.
- PD plot needs to be interpreted in conjuction with feature distribution, because we may overinterpret regions with almost no data.
- PDP assumes independence between features.
- PDP cannot be trusted when features are strongly correlated.
- Heterogeneous effects might be hidden because PD plot only shows the average marginal effects.

- Individual conditional expectation (ICE) plot
- ICE is the equivalent of a PDP for individual data points. (i.e. plot distributions instead of average)
- ICE plot visualizes the dependence of the prediction on a feature for each data point separately, resulting in one line per data point, compared to one line overall in PDP.
- Pros:
- More intuitive than PDP to understand.
- Able to uncover heterogeneous relationships (unlike PDP).

- Accumulated local effects (ALE) plot
- ALE describes how features influence the prediction of a machine learning model on average, by examining small window differences in predictions instead of averages.
- ALE plots are centered at zero. This makes their interpretation nice, because the value at each point of the ALE curve is the difference to the mean prediciton.
- As a general rule, use ALE instead of PDP.
- Pros:
- Unbiased, which means they still work when features are correlated.
- Faster to compute than PDP.
- Interpretation is clear. Conditional on a given value, the relative effect of changing the feature on the prediction can be read from the ALE plot.

- Cons:
- A bit shaky (many small ups and downs) with a high number of intervals.
- Unlike PDP, ALE plots are not accompanied by ICE curves.
- Second-order ALE estimates have varing stability across the feature space.
- Second-order effect plots can be annoying to interpret, as you always have to keep the main effects in mind.
- The implementation of ALE plots is much more complex and less intuitive.
- Interpretation remains difficult when features are strongly correlated.

- Friedman's H-statistic
- H-statistic estimates the strength of feature interaction, by measuring how much of the variation of the prediction depends on the interaction of the features.
- H-statistic can be slow to calculate, which means using subset of data may be required.
- Pros:
- Supported by an underlying theory through the partial dependence decomposition.
- Meaningful interpretation - the interaction is defined as the share of variance that is explained by the interaction.
- Dimensionless, which makes comparison across features or across models easy.
- Detects all kinds of interactions, regardless of the nature of interactions.
- Can analyse arbitrary higher interactions such as interaction strength between 3 or more features.

- Cons:
- Computationally expensive.
- Results can be unstable if using subset of data only.
- Cannot be used if the inputs are pixels (i.e. images).
- May be difficult to tell if an interaction is significantly greater than 0.
- Assume features are independent.

- Permutation feature importance
- Permutation feature importance measures the increase in the prediction error of the model after we permuted the feature's values.
- A feature is important if shuffling its values increases the model error; while a feature is unimportant if shuffling its values leaves the model error unchanged.
- Pros:
- Dimensionless, so comparable across different problems.
- Takes into account all interactions with other features. This is because by permuting the feature, you also destroy the interaction effects with other features.
- Does not require retraining the model.

- Cons:
- Unclear if feature importance should be calculated using training or testing data.
- Linked to the error of the model. Model variance (explained by the features) and feature importance correlate strongly when the model generalises well.
- Need access to the true outcome.
- Results might vary greatly. Due to random shuffling of data.
- Can be biased by unrealistic data points if features are correlated. E.g. height and weight of a person. By shuffling one of the features, we may create physically impossible combinations which we use to measure the importance.
- Adding a correlated feature can decrease the importance of the associated feature by splitting the importance between both features.

- Global surrogate
- A global surrogate model is an intepretable model that is trained to approximate the predictions of a black box model. Then we draw conclusion about the black box model by interpreting the surrogate model.
- How well a global surrogate model approximates a black box model can be measured by \( R^2 \).
- Pros:
- Flexible. Can exchange both the interpretable model and the underlying black box model.

- Cons:
- Must take note that surrogate model never sees the real outcome, as it was trained on black box predictions.
- Surrogate model may be very close for one subset of the dataset, but widely divergent for another subset.

- Local surrogate (LIME)
- Local surrogate models are interpretable models that are used to explain individual predictions of black box models.
- Local surrogate model is trained on a subset of original data combined with locally perturbed data to understand how black box models deal with specific data points.
- E.g. it can be used to understand why an image is classified as a bagel or a strawberry.
- Pros:
- Useful for giving simple human-friendly explanations.

- Cons:
- Instability of the explanations.

- Scoped rules (anchors)
- Anchors approach deploys a perturbation-based strategy to generate local explanations for predictions of black box models. However instead of surrogate models used by LIME, the resulting explanations are expressed as IF-THEN rules.
- This approach uses reinforcement learning techniques instead of fitting surrogate models.
- Pros:
- Rules are easy to interpret.

- Cons:
- Many hyperparameters to tune.

- Shapley values
- The Shapley value is a method for assigning payouts to players depending on their contribution to the total payout.
- The Shapley value is the average marginal contribution of a feature value across all possible coalitions.
- To interpret Shapley value: Given the current set of feature values, the contribution of a feature value to the difference between the actual prediction and the mean prediciton is the estimated Shapley value.
- Pros:
- The difference between the prediciton and the average prediciton is fairly distributed among the feature values of the instance. (Shapley value might be the only method to deliver a full explanation.)
- The Shapley value allows contrastive explanations.
- Only explanation method with a solid theory.

- Cons:
- Requires a lot of computing time.
- Can be misinterpreted. The Shapley value of a feature value is not the difference of the predicted value after removing the feature from the model training.
- Always use all the features when making explanations. Does not work with subset of features.
- Returns a simple value per feature with no prediction model.
- Similar to other permutation-based methods, it suffers from inclusion of unrealistic data instances when features are correlated.

- SHapley Additive exPlanations (SHAP)
- SHAP computes Shapley values and represent them as an additive feature attribution method in a linear model.
- Pros:
- All advantages of Shapley values apply.
- Connects LIME and Shapley values.
- Fast implementation for tree-based models (TreeSHAP).

- Cons:
- All disadvantages of Shapley values also apply.
- TreeSHAP can produce unintuitive feature attributions.