04 Data Prep & Exploratory Analysis in your Learning Dataset

Showing results for 
Search instead for 
Did you mean: 

04 Data Prep & Exploratory Analysis in your Learning Dataset

This is the 4th article in our Best Practices for Building ML Learning Datasets series.

In this article you’ll learn:

  • When your Learning Dataset should be cleaned.
  • Examples of data that should be cleaned.
  • Data Prep Protips.

If you read the first article in this series, you’ll remember that 80% of a data scientist’s time is spent finding, cleaning, and reorganizing data. But there’s great news: DataRobot has a data prep product, DataRobot Paxata, that empowers you to significantly reduce the amount of time you spend preparing your data.

The purpose of this article is to assist you in quickly assessing if you’re ready to start prepping your Learning Dataset and where you’ll find more help content when you’re ready to use DataRobot Paxata for your prep work.


When should my data be cleaned?

Your Learning Dataset should be cleaned before you begin any feature engineering. With DataRobot Paxata you can clean your data, and then prep it to add and remove features--all in a single project.


What data should be cleaned

If you’re looking for starting point examples of when you’ll need to clean your data, here are a few common examples.

  • Deduplication: you have a particular value represented in various ways and you need to standardize on one value, for example “New York”, “NY”, “New York City”, “NYC”, etc.

  • Remove leading values: for example, you need to remove leading zeros for Eastern US zip codes.

  • Standardizing date formats: for example, you have a dates column but the values in that column are represented in various formats--”mm/dd/yyyy”, “dd/mm/yy”, etc.

Data Prep Protips before you begin your prep

  • Consider before you aggregate. When you aggregate rows in your data, you are actually losing signals from the detailed records. If you think you must aggregate, then take the opportunity to use feature engineering to represent the data in another way and restore some of those lost signals. For example, you can add sums, means, standard deviations, etc. to create new features.

  • Consider your data point outliers before removing them from your data. Ask yourself if those observations in the data are valuable for the model to learn. Ultimately, you want to optimize for features that are important at prediction time because this results in faster computations with lower memory consumption.

Ready to start prepping your data? Visit DataRobot Paxata and Data Prep for Data Science where you’ll learn how to use our data prep product to:

Labels (2)
Version history
Last update:
‎06-29-2020 04:07 PM
Updated by: