Introducing data-describe: an awesome package for exploring data

Brian Ray
5 min readNov 10, 2020

data-describe is an exciting new open source project for Exploratory Data Analysis (EDA). Read the Press Release here; the Github project found here

Released in late October 2020 in beta by the Data Science Team at Maven Wave Partners, the effort was led by well known open source contributor, author, and leader, Yuan Tang (XGBoost, Apache MXNet, Kubeflow, TensorFlow, and ElasticDL), Yuan comments:

the data-describe project fills a gap in almost every data scientist’s tool belt.

In this article I’ll cover: the story on this awesome package; what features available today to get an edge; similar packages out there; what I would do to improve the package; and finally what you (Yes You!) can do to help.

The Story of data-describe

From the data-describe website:

data-describe is a Python toolkit for inspecting, illuminating, and investigating enormous amounts of unknown data with mixed relationships.

There are a couple of problems/annoyances that exists in the world of EDA that can hinder your productivity that data-describe sets out to resolve:

  • Packaging Hell
  • Miss our disjointed documentation
  • Overly Repetitive Tasks
  • Large Size of Data
  • Handling Sensitive Data
  • Disjoined Community
  • Missing Examples

Even though you use these packages like sklearn, pandas, seaborn, gensim, shapley all the time, something sometimes happens, a version changes, and welcome to Package Hell! data-describe solves this problem by shipping with a frozen set of dependencies that have been vetted and thoroughly tested by the project team before being released.

It’s surprising how many really great open source projects have really poor Documentation. data-describe has some really great documentation found here in the User Guide.

Ever find yourself cutting the same cell of code from your Jupyter Notebook time after time that basically does the exact same thing over and over? Consider the work involved in manually mapping schemas, counting Nulls, and all those repetitive tasks. data-describe works well to follow the DRY (Don’t Repeat Yourself) mentality and eliminates these unnecessary Repetitive Tasks.

A large Size of Data can slow you down. Bigger than your laptop can handle and even with Cloud computing, which data-describe handles GCP Natively, it can take a long time to compute and display big data. data-describe solves this by backing the project with Modin on top of Apache Arrow via Ray, or Dask as well as visualize high-dimensional data using PCA and t-SNE.

Then there is the issue of privacy of Sensitive Data with machine learning data. Now the EDA process can handle that sensitive data upfront. data-describe identifies and handles things like PII.

While fitting in with some of the other tools out there nicely, particularly other popular open-source projects like Tensorflow, Kubeflow, TFX…, data-describe has its own Community that uses the tools every day. Those internal scripts that Data Science Teams tend to rely on from project to project are not open source, and they don’t have that community outside of those who happen to use it. Open source is a proven model and that is why Maven Wave has shared it.

One of the most potent knowledge-sharing opportunities comes with the fact that EDA generally does not have a ton of easy to follow Examples. data-describe ships with its own examples as well as another project, Awesome Data Science Models, that presents a library of more extensive examples (Lending Club, Census Income, Black Friday, Cellular Imaging, …) all of which use data-describe for EDA.

Similar Projects

data-describe is not the only project out there that has some aspect of EDA or data inspections. Also if you are really interested in this topic please check out these projects:

Let us know what features would help make data-describe even more awesome. Also check out below on how to contribute. To make it easy, I’ve added some commonly requested features below.

Requested features

GEO / GIS: Maps and other location-based features

JupyterLab Extension: embed data-describe closer to the popular Notebook style server.

Full page reporting feature. While Jupyter is great, it would be even more helpful to have a way to generate a full report based off of code.

Long operation warnings and progress indicators. More informative features around how you interact with larger data or indicate the possibility of or progress of long-running requests.

Exported procedures. Export a series of operations so they may be used elsewhere like in more sophisticated machine learning models

Accepting Contributions

It’s easy and fun to contribute. You will for sure unlock your inner MEOW!

Contributing Read Me: https://github.com/data-describe/data-describe/blob/master/CONTRIBUTING.md

Contributing guide: https://data-describe.ai/docs/master/_notebooks/developer_guide.html#Contributing-Guide

Code of Conduct: https://github.com/data-describe/data-describe/blob/master/CODE_OF_CONDUCT.md

Again if you want more info on data-describe please check out the website https://data-describe.ai or join the slack channel

--

--

Brian Ray

Long time Python-isto, Inquisitor, Solver, Data Science in Cognitive/AI/Machine Learning Frequent Flyer