Introducing data-describe: an awesome package for exploring data
data-describe is an exciting new open source project for Exploratory Data Analysis (EDA). Read the Press Release here; the Github project found here
Released in late October 2020 in beta by the Data Science Team at Maven Wave Partners, the effort was led by well known open source contributor, author, and leader, Yuan Tang (XGBoost, Apache MXNet, Kubeflow, TensorFlow, and ElasticDL), Yuan comments:
the data-describe project fills a gap in almost every data scientist’s tool belt.
In this article I’ll cover: the story on this awesome package; what features available today to get an edge; similar packages out there; what I would do to improve the package; and finally what you (Yes You!) can do to help.
The Story of data-describe
From the data-describe website:
data-describe is a Python toolkit for inspecting, illuminating, and investigating enormous amounts of unknown data with mixed relationships.
There are a couple of problems/annoyances that exists in the world of EDA that can hinder your productivity that data-describe sets out to resolve:
- Packaging Hell
- Miss our disjointed documentation
- Overly Repetitive Tasks
- Large Size of Data
- Handling Sensitive Data
- Disjoined Community
- Missing Examples
Even though you use these packages like sklearn, pandas, seaborn, gensim, shapley all the time, something sometimes happens, a version changes, and welcome to Package Hell! data-describe solves this problem by shipping with a frozen set of dependencies that have been vetted and thoroughly tested by the project team before being released.
It’s surprising how many really great open source projects have really poor Documentation. data-describe has some really great documentation found here in the User Guide.
Ever find yourself cutting the same cell of code from your Jupyter Notebook time after time that basically does the exact same thing over and over? Consider the work involved in manually mapping schemas, counting Nulls, and all those repetitive tasks. data-describe works well to follow the DRY (Don’t Repeat Yourself) mentality and eliminates these unnecessary Repetitive Tasks.
A large Size of Data can slow you down. Bigger than your laptop can handle and even with Cloud computing, which data-describe handles GCP Natively, it can take a long time to compute and display big data. data-describe solves this by backing the project with Modin on top of Apache Arrow via Ray, or Dask as well as visualize high-dimensional data using PCA and t-SNE.
Then there is the issue of privacy of Sensitive Data with machine learning data. Now the EDA process can handle that sensitive data upfront. data-describe identifies and handles things like PII.
While fitting in with some of the other tools out there nicely, particularly other popular open-source projects like Tensorflow, Kubeflow, TFX…, data-describe has its own Community that uses the tools every day. Those internal scripts that Data Science Teams tend to rely on from project to project are not open source, and they don’t have that community outside of those who happen to use it. Open source is a proven model and that is why Maven Wave has shared it.
One of the most potent knowledge-sharing opportunities comes with the fact that EDA generally does not have a ton of easy to follow Examples. data-describe ships with its own examples as well as another project, Awesome Data Science Models, that presents a library of more extensive examples (Lending Club, Census Income, Black Friday, Cellular Imaging, …) all of which use data-describe for EDA.
Similar Projects
data-describe is not the only project out there that has some aspect of EDA or data inspections. Also if you are really interested in this topic please check out these projects:
- pandas-profiling: Create HTML profiling reports from pandas DataFrame objects https://github.com/pandas-profiling/pandas-profiling
- dataprep.eda: The goal of dataprep.eda module is to simplify [eda] and allow user explore important characteristics as many as possible via only a few APIs. https://sfu-db.github.io/dataprep/eda/introduction.html
- sweetviz: Sweetviz is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. https://github.com/fbdesignpro/sweetviz
- Pandas Plotly: is a third-party wrapper library around Plotly, inspired by the Pandas .plot() API. https://plotly.com/python/pandas-backend/
- Pandas Bokeh: Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of Pandas. https://github.com/PatrikHlobil/Pandas-Bokeh
- Facets: The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive. https://github.com/PAIR-code/facets
- Tributary:More like pipelines https://github.com/timkpaine/tributary
- TextHero:NLP toolkit https://github.com/jbesomi/texthero
- Lantern:Plotting with Jupyterlab extensions and mock data https://github.com/timkpaine/lantern
- ipywidgets: Interactive visuals in Jupyter https://ipywidgets.readthedocs.io/en/latest/
- Voila: Really to interact with Jupyter https://github.com/voila-dashboards/voila
- Dabl: The data analysis baseline library. https://dabl.github.io/dev/
- klib: klib is a Python library for importing, cleaning, analyzing and preprocessing data. While the focus is on these steps, future versions will include modules and functions for model creation and optimization to provide more of an end-to-end solution. https://github.com/akanz1/klib
- scattertext: A tool for finding distinguishing terms in corpora, and presenting them in an interactive, HTML scatter plot. Points corresponding to terms are selectively labeled so that they don’t overlap with other labels or points. https://github.com/JasonKessler/scattertext
- quickda: Simple & Easy-to-use python modules to perform Quick Exploratory Data Analysis for any structured dataset! https://github.com/sid-the-coder/QuickDA
- exploripy: ExploriPy reduces a data analyst’s efforts significantly in the initial EDA. It is designed in a way to perform automated EDA, and statistical tests including Analysis of Variance, Chi-Square Test of Independence, Weight of Evidence, Information Value and Tukey Honest Significant Difference. It provides easy interpretation of these statistical test results, based on industry-standard assumptions. It expects a Pandas DataFrame, along with a list of categorical variables, as input. The output will be a presentable HTML document, with the result of analysis and statistical tests, represented through several interactive charts, and tables (option to download as CSV). https://github.com/exploripy/exploripy
Let us know what features would help make data-describe even more awesome. Also check out below on how to contribute. To make it easy, I’ve added some commonly requested features below.
Requested features
GEO / GIS: Maps and other location-based features
JupyterLab Extension: embed data-describe closer to the popular Notebook style server.
Full page reporting feature. While Jupyter is great, it would be even more helpful to have a way to generate a full report based off of code.
Long operation warnings and progress indicators. More informative features around how you interact with larger data or indicate the possibility of or progress of long-running requests.
Exported procedures. Export a series of operations so they may be used elsewhere like in more sophisticated machine learning models
Accepting Contributions
It’s easy and fun to contribute. You will for sure unlock your inner MEOW!
Contributing Read Me: https://github.com/data-describe/data-describe/blob/master/CONTRIBUTING.md
Contributing guide: https://data-describe.ai/docs/master/_notebooks/developer_guide.html#Contributing-Guide
Code of Conduct: https://github.com/data-describe/data-describe/blob/master/CODE_OF_CONDUCT.md
Again if you want more info on data-describe please check out the website https://data-describe.ai or join the slack channel