Data Exploration and Feature Engineering
by Chee Yee Lim
Posted on 2021-03-16
Collection of notes on exploratory data analysis and feature selection/engineering - high level overview of key steps in extracting insights out of data before model training.
- EDA is perform to (1) check for data quality issues such as missing data and presence of outliers, and (2) discover any existing relationships or patterns in the data.
- EDA is best done with (1) plots and (2) statistical tests.
- Insights extracted from data during EDA is often very helpful to guide model setup and training.
- Summary statistics per variable
- Check proportions of missing values etc.
- Plot distributions (e.g. histogram, density plot, box plot)
- Pair plot between variables (e.g. scatter, density plot)/
- Correlation matrix with heatmap plot.
- Dimensional reduction of variables (e.g. PCA).
- Unsupervised clustering of variables.
- Feature selection is an important preprocessing step in identifying important features and eliminating irrelevant or redundant features.
- It improves prediction performance, improves model training efficiency and reduce dimensionality.
- The initial step in doing feature selection is to rely on domain knowledge to come up with a list of relevant features.
- These features can then be evaluated using algorithms to verify their importance.
- Filter methods
- Assign ranking to each feature using statistical techniques.
- E.g. chi-squared test, correlation coefficients and information gain.
- Wrapper methods
- Use a subset of features to train a model. Usually built into modelling algorithms.
- E.g. recursive feature elimination, backward elimination and forward selection.
- Embedded methods
- Combine both filter and wrapper methods.
- E.g. LASSO and RIDGE regression, regularised trees and random multinomial logit.