by Chee Yee Lim

Posted on 2021-05-01

Collection of notes on data preprocessing - high level overview of anything related to data preprocessing.

- Data Properties
- Data Preprocessing Steps
- Dealing with missing data
- Dealing with outliers
- Handling categorical data
- Normalising/transforming data
- Handling non-independent data
- Handling multicollinearity
- Checking autocorrelations (time series)
- Ensuring stationarity (time series)
- Ensuring homogeneity of variance (homoscedasticity)
- Handling class imbalance in data

- Nominal scale
- Categorical, with no ordering between levels.

- Ordinal scale
- Categorical, with ordering between levels.
- Difference between ranks is consistent with direction but not magnitude.

- Interval scale
- Ranges over values. Preserve information on magnitude.
- Ranges of intervals do not need to be constant.

- Ratio scale
- Discrete or continuous values. Preserve information on magnitude and relate to some true zero points.

- Solutions
- Remove observations with missing values.
- Replace with summary values (e.g. quantitative - mean, median; categorical - mode).
- Replace with forward fill values (on time series data).
- Use a predictive model to predict missing values.

- Tests
- Standard deviations
- Interquartile range / Box plot method
- Any value higher than 1.5*IQR above upper quartile is an outlier.

- Normalised residual vs leverage plot
- Observations that have high leverage and residual are outliers, that can be confirmed by running a Bonferroni outlier test.
- Leverage is a measure of how far away the independent variable values of observation are from those of the other observations.

- Cook's D
- Test fro the influence of a value.
- Can be combined to give an influence plot (residual vs leverage).

- Clustering using DBScan
- Look for unclustered data points and/or unexpected clusters.

- Generative adversarial network (e.g. GAN-AD)
- Trained on regular data to learn normal distributions, then check agreement with newly observed data by checking for agreement between generator and discriminator.

- Solutions
- Remove outlier observations.
- Replace outlier values with a maximum/minimum cap on values (i.e. winsorisation).
- Keep outliers, if they represent rare but important observations.
- Can be improved by adding an additional dummy variable that corresponds to outlier status (i.e. 1 for outlier, 0 for non-outlier).

- Solutions
- Convert into dummy variables with one-hot encoding.
- Always use one less (n-1) dummy variables than the number of categories. Otherwise will cause the matrix to not have full column rank.

- Convert into special categorical data types.
- Useful for models such as random forest that performs worse with one-hot encoding.

- Convert into dummy variables with one-hot encoding.

- Tests (for normal distribution)
- Q-Q (quantile-quantile) plot
- Observed and expected values should form a straight diagonal line.

- Skewness and kurtosis
- Skewness between \( \pm \) 2, kurtosis between \( \pm \) 7.

- Shapiro-Wilk’s W test
- Kolmogorov-Smirnov test

- Q-Q (quantile-quantile) plot
- Solutions
- Min-max scaling (per variable/data).
- Only handle scale, without handling skewness.

- Standardisation
- Transform variables to have zero mean and standard deviation of one (i.e. into z-scores).
- Works best for normally distributed data.

- Log transformation
- Transform values into the log scale by logging the values. \( w = \log{y} \)
- This transformation is easy to interpret because changes in a log value are relative changes on the original scale.
- It also constrains the values to stay positive on the original scale.

- Power transformation
- Apply power (square, cube etc) on each value. \( w = y^p \)

- Box-Cox transformation
- It is a family that includes both logarithms and power transformations.
- It has a parameter \( \lambda \) and is defined as follows:
- \( w = \begin{cases} \log{y} &\text{if $\lambda=0$;} \\\ \frac{(y^\lambda-1)}{\lambda} &\text{otherwise} \end{cases} \)

- The value of \( \lambda \) can be determined by computing the correlations of data transformed with multiple \( \lambda \) values vs a normal distribution.
- The Box-Cox normality plot can then be made, which contains the correlation coefficients vs \( \lambda \) values.
- The \( \lambda \) value that gives the maximum correlation should be chosen as the ideal value to perform transformation.

- After performing forecast, it is important to reverse the transformed data back onto the original scale.
- The reverse Box-Cox transformation is given by:
- \( y = \begin{cases} \exp{w} &\text{if $\lambda=0$;} \\\ (\lambda w + 1)^\frac{1}{\lambda} &\text{otherwise} \end{cases} \)

- Min-max scaling (per variable/data).

- Solutions
- Mixed effect models
- Predict both (typical beta) fixed effect and (new added variable) random effect.

- Generalised estimation equation (GEE)
- For possible unknown correlation between outcomes.

- Mixed effect models

- Overview
- Multicollinearity is a situation where one or more of the independent variables are strongly correlated with each other.
- In such situation, we should use only one variable among correlated independent variables.

- Tests
- Correlation matrix of independent variables
- Anything higher than 0.8 is highly correlated.
- Note that correlation matrix tests for correlation between a pair of variables at a time, so it will not take into account of interaction among variables.

- Variance inflation factor (VIF)
- If VIF > 10, then high degree of correlation exists.
- If VIF > 1 but < 5, moderate correlation exists.
- If VIF = 1, this means there is no correlation.
- \( VIF_{i} = \frac{1}{1-R_{i}^2} \), where \( R_{i}^2 \) is the coefficient of determination of variable \( X_{i} \).

- Correlation matrix of independent variables
- Solutions
- Use dimensionality reduction transformation on data
- By definition, the dimensions of dimensionally reduced data are orthogonal (e.g. PCA).

- Remove variables with large VIF (>10)
- Has to be done iteratively, as the VIFs of all values are affected by the inclusion/exclusion of a variable under consideration.

- Run L2 regularisation during model fitting
- Use condition number/indexes
- Looks at a set of variables, rather than individual variables.

- Use dimensionality reduction transformation on data

- Tests
- Autocorrelation and partial autocorrelation function (ACF/PACF) plots
- Durbin Watson test
- The test looks for the presence of first order autocorrelation (i.e. \( AR(1) \)) between the residuals.
- The test statistic is calculated using the formula: \( DW = \frac{ \sum_{t=2}^T ( e_{t} - e_{t-1} )^2 }{ \sum_{t=1}^T e_{t}^2 } \), where \( e_{t} \) are residuals from an OLS linear regression.
- It is considered archaic and needs the use of a special statistics table.
- It always have a value of between 0 and 4.
- If it has a value of 2, this means it has no autocorrelation.
- If it is between 1.5 and 2.5, this means it has acceptable autocorrelation.
- If it is between 0 and <2, this means positive autocorrelation.
- If it is between >2 and 4, this means negative autocorrelation (which is rare in time series).

- Lagrange Multiplier test

- Solutions
- Data differencing (i.e. take absolute or percentage difference between consecutive values).

- Tests
- Dickey Fuller test
- The null hypothesis is that the data are stationary, so a test result of p-value > 0.0.5 means data is non-stationary.

- Dickey Fuller test
- Solutions
- Remove trend (i.e. rolling mean) and seasonality (i.e. periodic mean) components from the series.
- Log transform data (to reduce variability).

- Tests
- Residuals by predicted values plot
- Levene's test

- Solutions
- Box-Cox transformation
- Weighted least squares

- Solutions
- Over-sample minority class
- Can use random sampling or stratified sampling.

- Generate new synthetic observations for minority class via interpolation
- Can be integrated into model fitting process by generating new synthetic observations for wrongly classified observations or observations located at decision boundaries.
- Warning - can lead to overfitting or heavily influenced by outliers.

- Under-sample majority class
- By generating new data points sampled from k-means centroids.

- Over-sample minority class