by Chee Yee Lim

Posted on 2021-03-16

Collection of notes on exploratory data analysis and feature selection/engineering - high level overview of key steps in extracting insights out of data before model training.

- EDA is perform to (1) check for data quality issues such as missing data and presence of outliers, and (2) discover any existing relationships or patterns in the data.
- EDA is best done with (1) plots and (2) statistical tests.
- Insights extracted from data during EDA is often very helpful to guide model setup and training.

- Summary statistics per variable
- Check proportions of missing values etc.
- Plot distributions (e.g. histogram, density plot, box plot)

- Pair plot between variables (e.g. scatter, density plot)/
- Correlation matrix with heatmap plot.
- Dimensional reduction of variables (e.g. PCA).
- Unsupervised clustering of variables.

- Feature selection is an important preprocessing step in identifying important features and eliminating irrelevant or redundant features.
- It improves prediction performance, improves model training efficiency and reduce dimensionality.
- The initial step in doing feature selection is to rely on domain knowledge to come up with a list of relevant features.
- These features can then be evaluated using algorithms to verify their importance.

- Filter methods
- Assign ranking to each feature using statistical techniques.
- E.g. chi-squared test, correlation coefficients and information gain.

- Wrapper methods
- Use a subset of features to train a model. Usually built into modelling algorithms.
- E.g. recursive feature elimination, backward elimination and forward selection.

- Embedded methods
- Combine both filter and wrapper methods.
- E.g. LASSO and RIDGE regression, regularised trees and random multinomial logit.