Comparing EDA libraries

by Chee Yee Lim


Posted on 2021-08-28



Comparing Python packages for data processing and/or exploratory data analysis


Overview

I have tested and reviewed a few Python packages for data processing and/or exploratory data analysis (EDA). Most of these packages attempt to automate parts of the data processing and/or EDA process, or provide a suite of functions to manipulate and visualize data.

The main objective here is to review and explore Python packages that will shorten the time needed for data processing and/or exploratory data analysis.

In summary, sweetviz may be the best option under a business setting (with focus on business understanding) while autoviz/pandas-profiling are solid choices under an R&D setting (with focus on deep dive analysis).

Comparison Table

Package sweetviz pandas-profiling autoviz lux-api dtale dataprep
Version 2.1.3 3.0.0 0.0.83 0.3.2 1.56.0 0.3.0
Recommended for exploration Yes No Yes No No Yes
Recommended for production No No No No No No
Ease of use Yes Yes Yes Yes Yes Yes
Computation speed Fast Medium Fast Fast Fast Fast
Installation complexity Low Low Low Low Medium Low
Target variable-centric Yes No Yes Yes No Yes
Missing data check Yes Yes No No Yes Yes
Per variable summary statistics Yes Yes No No Yes Yes
AutoEDA focus No Yes Yes No No Yes
Score 5 3 3 1 4 4

Key points

  • sweetviz
    • Beautiful and simple visualisation that is good for business explanation.
    • Limited features beyond simple visualisation.
  • pandas-profiling
    • Good for generic data quality check.
    • No option to specify target variable for tailored EDA.
  • autoviz
    • Holistic visualisations that are good for deep dive analysis.
    • Claims to do smart selection of plots/analyses but have yet to see the impact.
  • lux-api
    • Generate very simple single or pair variables distribution plots.
    • Plot of interest selection is manual.
    • Charts are not integrated into notebook (as widgets only).
  • dtale
    • Very good as a Python-based data manipulation tool with GUI.
    • Contains a full suite of tools for data manipulation/analysis/visualisation, almost matching similar commercial data analytics tools.
    • Too much manual effort required to setup all analyses/plots required.
    • No dashboard deployment support.
    • No option to specify target variable for tailored EDA.
    • Complex dependecies on many libraries.
  • dataprep
    • EDA module is like an extension to pandas-profiling. Very comprehensive.
    • Contains 3 modules to : collect data, explore data and clean data.
    • Potentially faster on large data due to use of Dask.
    • Possible to deep dive investigate selected columns/variables.

Jupyter notebooks