Data Preprocessing

by Chee Yee Lim

Posted on 2021-05-01

Collection of notes on data preprocessing - high level overview of anything related to data preprocessing.

Data Properties

Data Scales of Measurement

• Nominal scale
• Categorical, with no ordering between levels.
• Ordinal scale
• Categorical, with ordering between levels.
• Difference between ranks is consistent with direction but not magnitude.
• Interval scale
• Ranges over values. Preserve information on magnitude.
• Ranges of intervals do not need to be constant.
• Ratio scale
• Discrete or continuous values. Preserve information on magnitude and relate to some true zero points.

Data Preprocessing Steps

Dealing with missing data

• Solutions
1. Remove observations with missing values.
2. Replace with summary values (e.g. quantitative - mean, median; categorical - mode).
3. Replace with forward fill values (on time series data).
4. Use a predictive model to predict missing values.

Dealing with outliers

• Tests
1. Standard deviations
2. Interquartile range / Box plot method
• Any value higher than 1.5*IQR above upper quartile is an outlier.
3. Normalised residual vs leverage plot
• Observations that have high leverage and residual are outliers, that can be confirmed by running a Bonferroni outlier test.
• Leverage is a measure of how far away the independent variable values of observation are from those of the other observations.
4. Cook's D
• Test fro the influence of a value.
• Can be combined to give an influence plot (residual vs leverage).
5. Clustering using DBScan
• Look for unclustered data points and/or unexpected clusters.
• Trained on regular data to learn normal distributions, then check agreement with newly observed data by checking for agreement between generator and discriminator.
• Solutions
1. Remove outlier observations.
2. Replace outlier values with a maximum/minimum cap on values (i.e. winsorisation).
3. Keep outliers, if they represent rare but important observations.
• Can be improved by adding an additional dummy variable that corresponds to outlier status (i.e. 1 for outlier, 0 for non-outlier).

Handling categorical data

• Solutions
1. Convert into dummy variables with one-hot encoding.
• Always use one less (n-1) dummy variables than the number of categories. Otherwise will cause the matrix to not have full column rank.
2. Convert into special categorical data types.
• Useful for models such as random forest that performs worse with one-hot encoding.

Normalising/transforming data

• Tests (for normal distribution)
1. Q-Q (quantile-quantile) plot
• Observed and expected values should form a straight diagonal line.
2. Skewness and kurtosis
• Skewness between $$\pm$$ 2, kurtosis between $$\pm$$ 7.
3. Shapiro-Wilk’s W test
4. Kolmogorov-Smirnov test
• Solutions
1. Min-max scaling (per variable/data).
• Only handle scale, without handling skewness.
2. Standardisation
• Transform variables to have zero mean and standard deviation of one (i.e. into z-scores).
• Works best for normally distributed data.
3. Log transformation
• Transform values into the log scale by logging the values. $$w = \log{y}$$
• This transformation is easy to interpret because changes in a log value are relative changes on the original scale.
• It also constrains the values to stay positive on the original scale.
4. Power transformation
• Apply power (square, cube etc) on each value. $$w = y^p$$
5. Box-Cox transformation
• It is a family that includes both logarithms and power transformations.
• It has a parameter $$\lambda$$ and is defined as follows:
• $$w = \begin{cases} \log{y} &\text{if \lambda=0;} \\\ \frac{(y^\lambda-1)}{\lambda} &\text{otherwise} \end{cases}$$
• The value of $$\lambda$$ can be determined by computing the correlations of data transformed with multiple $$\lambda$$ values vs a normal distribution.
• The Box-Cox normality plot can then be made, which contains the correlation coefficients vs $$\lambda$$ values.
• The $$\lambda$$ value that gives the maximum correlation should be chosen as the ideal value to perform transformation.
• After performing forecast, it is important to reverse the transformed data back onto the original scale.
• The reverse Box-Cox transformation is given by:
• $$y = \begin{cases} \exp{w} &\text{if \lambda=0;} \\\ (\lambda w + 1)^\frac{1}{\lambda} &\text{otherwise} \end{cases}$$

Handling non-independent data

• Solutions
1. Mixed effect models
• Predict both (typical beta) fixed effect and (new added variable) random effect.
2. Generalised estimation equation (GEE)
• For possible unknown correlation between outcomes.

Handling multicollinearity

• Overview
• Multicollinearity is a situation where one or more of the independent variables are strongly correlated with each other.
• In such situation, we should use only one variable among correlated independent variables.
• Tests
1. Correlation matrix of independent variables
• Anything higher than 0.8 is highly correlated.
• Note that correlation matrix tests for correlation between a pair of variables at a time, so it will not take into account of interaction among variables.
2. Variance inflation factor (VIF)
• If VIF > 10, then high degree of correlation exists.
• If VIF > 1 but < 5, moderate correlation exists.
• If VIF = 1, this means there is no correlation.
• $$VIF_{i} = \frac{1}{1-R_{i}^2}$$, where $$R_{i}^2$$ is the coefficient of determination of variable $$X_{i}$$.
• Solutions
1. Use dimensionality reduction transformation on data
• By definition, the dimensions of dimensionally reduced data are orthogonal (e.g. PCA).
2. Remove variables with large VIF (>10)
• Has to be done iteratively, as the VIFs of all values are affected by the inclusion/exclusion of a variable under consideration.
3. Run L2 regularisation during model fitting
4. Use condition number/indexes
• Looks at a set of variables, rather than individual variables.

Checking autocorrelations (time series)

• Tests
1. Autocorrelation and partial autocorrelation function (ACF/PACF) plots
2. Durbin Watson test
• The test looks for the presence of first order autocorrelation (i.e. $$AR(1)$$) between the residuals.
• The test statistic is calculated using the formula: $$DW = \frac{ \sum_{t=2}^T ( e_{t} - e_{t-1} )^2 }{ \sum_{t=1}^T e_{t}^2 }$$, where $$e_{t}$$ are residuals from an OLS linear regression.
• It is considered archaic and needs the use of a special statistics table.
• It always have a value of between 0 and 4.
• If it has a value of 2, this means it has no autocorrelation.
• If it is between 1.5 and 2.5, this means it has acceptable autocorrelation.
• If it is between 0 and <2, this means positive autocorrelation.
• If it is between >2 and 4, this means negative autocorrelation (which is rare in time series).
3. Lagrange Multiplier test
• Solutions
1. Data differencing (i.e. take absolute or percentage difference between consecutive values).

Ensuring stationarity (time series)

• Tests
1. Dickey Fuller test
• The null hypothesis is that the data are stationary, so a test result of p-value > 0.0.5 means data is non-stationary.
• Solutions
1. Remove trend (i.e. rolling mean) and seasonality (i.e. periodic mean) components from the series.
2. Log transform data (to reduce variability).

Ensuring homogeneity of variance (homoscedasticity)

• Tests
1. Residuals by predicted values plot
2. Levene's test
• Solutions
1. Box-Cox transformation
2. Weighted least squares

Handling class imbalance in data

• Solutions
1. Over-sample minority class
• Can use random sampling or stratified sampling.
2. Generate new synthetic observations for minority class via interpolation
• Can be integrated into model fitting process by generating new synthetic observations for wrongly classified observations or observations located at decision boundaries.
• Warning - can lead to overfitting or heavily influenced by outliers.
3. Under-sample majority class
• By generating new data points sampled from k-means centroids.