Data Exploration and Feature Engineering

by Chee Yee Lim


Posted on 2021-03-16



Collection of notes on exploratory data analysis and feature selection/engineering - high level overview of key steps in extracting insights out of data before model training.


Exploratory Data Analysis (EDA)

Overview

  • EDA is perform to (1) check for data quality issues such as missing data and presence of outliers, and (2) discover any existing relationships or patterns in the data.
  • EDA is best done with (1) plots and (2) statistical tests.
  • Insights extracted from data during EDA is often very helpful to guide model setup and training.

Univariate Analysis

  1. Summary statistics per variable
  2. Check proportions of missing values etc.
  3. Plot distributions (e.g. histogram, density plot, box plot)

Multivariate Analysis

  1. Pair plot between variables (e.g. scatter, density plot)/
  2. Correlation matrix with heatmap plot.
  3. Dimensional reduction of variables (e.g. PCA).
  4. Unsupervised clustering of variables.

Feature Selection

Overview

  • Feature selection is an important preprocessing step in identifying important features and eliminating irrelevant or redundant features.
  • It improves prediction performance, improves model training efficiency and reduce dimensionality.
  • The initial step in doing feature selection is to rely on domain knowledge to come up with a list of relevant features.
    • These features can then be evaluated using algorithms to verify their importance.

List of Algorithms

  • Filter methods
    • Assign ranking to each feature using statistical techniques.
    • E.g. chi-squared test, correlation coefficients and information gain.
  • Wrapper methods
    • Use a subset of features to train a model. Usually built into modelling algorithms.
    • E.g. recursive feature elimination, backward elimination and forward selection.
  • Embedded methods
    • Combine both filter and wrapper methods.
    • E.g. LASSO and RIDGE regression, regularised trees and random multinomial logit.