Data Preprocessing

by Chee Yee Lim


Posted on 2021-05-01



Collection of notes on data preprocessing - high level overview of anything related to data preprocessing.


Data Properties

Data Scales of Measurement

  • Nominal scale
    • Categorical, with no ordering between levels.
  • Ordinal scale
    • Categorical, with ordering between levels.
    • Difference between ranks is consistent with direction but not magnitude.
  • Interval scale
    • Ranges over values. Preserve information on magnitude.
    • Ranges of intervals do not need to be constant.
  • Ratio scale
    • Discrete or continuous values. Preserve information on magnitude and relate to some true zero points.

Data Preprocessing Steps

Dealing with missing data

  • Solutions
    1. Remove observations with missing values.
    2. Replace with summary values (e.g. quantitative - mean, median; categorical - mode).
    3. Replace with forward fill values (on time series data).
    4. Use a predictive model to predict missing values.

Dealing with outliers

  • Tests
    1. Standard deviations
    2. Interquartile range / Box plot method
      • Any value higher than 1.5*IQR above upper quartile is an outlier.
    3. Normalised residual vs leverage plot
      • Observations that have high leverage and residual are outliers, that can be confirmed by running a Bonferroni outlier test.
      • Leverage is a measure of how far away the independent variable values of observation are from those of the other observations.
    4. Cook's D
      • Test fro the influence of a value.
      • Can be combined to give an influence plot (residual vs leverage).
    5. Clustering using DBScan
      • Look for unclustered data points and/or unexpected clusters.
    6. Generative adversarial network (e.g. GAN-AD)
      • Trained on regular data to learn normal distributions, then check agreement with newly observed data by checking for agreement between generator and discriminator.
  • Solutions
    1. Remove outlier observations.
    2. Replace outlier values with a maximum/minimum cap on values (i.e. winsorisation).
    3. Keep outliers, if they represent rare but important observations.
      • Can be improved by adding an additional dummy variable that corresponds to outlier status (i.e. 1 for outlier, 0 for non-outlier).

Handling categorical data

  • Solutions
    1. Convert into dummy variables with one-hot encoding.
      • Always use one less (n-1) dummy variables than the number of categories. Otherwise will cause the matrix to not have full column rank.
    2. Convert into special categorical data types.
      • Useful for models such as random forest that performs worse with one-hot encoding.

Normalising/transforming data

  • Tests (for normal distribution)
    1. Q-Q (quantile-quantile) plot
      • Observed and expected values should form a straight diagonal line.
    2. Skewness and kurtosis
      • Skewness between \( \pm \) 2, kurtosis between \( \pm \) 7.
    3. Shapiro-Wilk’s W test
    4. Kolmogorov-Smirnov test
  • Solutions
    1. Min-max scaling (per variable/data).
      • Only handle scale, without handling skewness.
    2. Standardisation
      • Transform variables to have zero mean and standard deviation of one (i.e. into z-scores).
      • Works best for normally distributed data.
    3. Log transformation
      • Transform values into the log scale by logging the values. \( w = \log{y} \)
      • This transformation is easy to interpret because changes in a log value are relative changes on the original scale.
      • It also constrains the values to stay positive on the original scale.
    4. Power transformation
      • Apply power (square, cube etc) on each value. \( w = y^p \)
    5. Box-Cox transformation
      • It is a family that includes both logarithms and power transformations.
      • It has a parameter \( \lambda \) and is defined as follows:
        • \( w = \begin{cases} \log{y} &\text{if $\lambda=0$;} \\\ \frac{(y^\lambda-1)}{\lambda} &\text{otherwise} \end{cases} \)
      • The value of \( \lambda \) can be determined by computing the correlations of data transformed with multiple \( \lambda \) values vs a normal distribution.
        • The Box-Cox normality plot can then be made, which contains the correlation coefficients vs \( \lambda \) values.
        • The \( \lambda \) value that gives the maximum correlation should be chosen as the ideal value to perform transformation.
      • After performing forecast, it is important to reverse the transformed data back onto the original scale.
      • The reverse Box-Cox transformation is given by:
        • \( y = \begin{cases} \exp{w} &\text{if $\lambda=0$;} \\\ (\lambda w + 1)^\frac{1}{\lambda} &\text{otherwise} \end{cases} \)

Handling non-independent data

  • Solutions
    1. Mixed effect models
      • Predict both (typical beta) fixed effect and (new added variable) random effect.
    2. Generalised estimation equation (GEE)
      • For possible unknown correlation between outcomes.

Handling multicollinearity

  • Overview
    • Multicollinearity is a situation where one or more of the independent variables are strongly correlated with each other.
    • In such situation, we should use only one variable among correlated independent variables.
  • Tests
    1. Correlation matrix of independent variables
      • Anything higher than 0.8 is highly correlated.
      • Note that correlation matrix tests for correlation between a pair of variables at a time, so it will not take into account of interaction among variables.
    2. Variance inflation factor (VIF)
      • If VIF > 10, then high degree of correlation exists.
      • If VIF > 1 but < 5, moderate correlation exists.
      • If VIF = 1, this means there is no correlation.
      • \( VIF_{i} = \frac{1}{1-R_{i}^2} \), where \( R_{i}^2 \) is the coefficient of determination of variable \( X_{i} \).
  • Solutions
    1. Use dimensionality reduction transformation on data
      • By definition, the dimensions of dimensionally reduced data are orthogonal (e.g. PCA).
    2. Remove variables with large VIF (>10)
      • Has to be done iteratively, as the VIFs of all values are affected by the inclusion/exclusion of a variable under consideration.
    3. Run L2 regularisation during model fitting
    4. Use condition number/indexes
      • Looks at a set of variables, rather than individual variables.

Checking autocorrelations (time series)

  • Tests
    1. Autocorrelation and partial autocorrelation function (ACF/PACF) plots
    2. Durbin Watson test
      • The test looks for the presence of first order autocorrelation (i.e. \( AR(1) \)) between the residuals.
      • The test statistic is calculated using the formula: \( DW = \frac{ \sum_{t=2}^T ( e_{t} - e_{t-1} )^2 }{ \sum_{t=1}^T e_{t}^2 } \), where \( e_{t} \) are residuals from an OLS linear regression.
      • It is considered archaic and needs the use of a special statistics table.
      • It always have a value of between 0 and 4.
        • If it has a value of 2, this means it has no autocorrelation.
        • If it is between 1.5 and 2.5, this means it has acceptable autocorrelation.
        • If it is between 0 and <2, this means positive autocorrelation.
        • If it is between >2 and 4, this means negative autocorrelation (which is rare in time series).
    3. Lagrange Multiplier test
  • Solutions
    1. Data differencing (i.e. take absolute or percentage difference between consecutive values).

Ensuring stationarity (time series)

  • Tests
    1. Dickey Fuller test
      • The null hypothesis is that the data are stationary, so a test result of p-value > 0.0.5 means data is non-stationary.
  • Solutions
    1. Remove trend (i.e. rolling mean) and seasonality (i.e. periodic mean) components from the series.
    2. Log transform data (to reduce variability).

Ensuring homogeneity of variance (homoscedasticity)

  • Tests
    1. Residuals by predicted values plot
    2. Levene's test
  • Solutions
    1. Box-Cox transformation
    2. Weighted least squares

Handling class imbalance in data

  • Solutions
    1. Over-sample minority class
      • Can use random sampling or stratified sampling.
    2. Generate new synthetic observations for minority class via interpolation
      • Can be integrated into model fitting process by generating new synthetic observations for wrongly classified observations or observations located at decision boundaries.
      • Warning - can lead to overfitting or heavily influenced by outliers.
    3. Under-sample majority class
      • By generating new data points sampled from k-means centroids.