03 Building a "Learning Dataset" for your ML model

Showing results for 
Search instead for 
Did you mean: 

03 Building a "Learning Dataset" for your ML model

This is the 3rd article in our Best Practices for Building ML Learning Datasets series.

In this article you’ll learn:

  • Three important questions to ask yourself  before building your Learning Dataset.
  • The steps and process for building a Learning Dataset.

Important questions to ask before you begin assembling the data

When you are ready to start building your Learning Dataset, ask yourself:

  1. What do I want to predict?
  2. For what or whom?
  3. When do I want to make this prediction?

Let’s look at how to answer these questions for concrete examples.

  • Hospital readmission: if you’ve been following this series of articles, then you are now familiar with the hospital readmission prediction in which you want to predict if a diabetic patient will be readmitted to hospital; for an issue related to his or her diabetes; and make this prediction at the time a patient is discharge from the hospital.

  • Customer churn: anyone working with customer renewals is aware of the importance of anticipating customer churn rates for service renewals. In this example you want to predict the churn probability; for customers who subscribe to a SaaS offering; during the next three weeks.

Notice that we have a time component in both concrete examples above. Also notice that in the second example, we are defining a window of time, not a specific moment in time. 

However, what you want to predict may not always require a time feature. For example, a classic linear regression problem that forecasts the future cost of real estate—but at no specific time in the future, just in the ongoing window of time—does not require a “for when” to be answered. But as a best practice, it’s always good to ask yourself the “when” for your prediction to ensure you’re very precise in understanding and articulating the problem you want to solve. All of these dimensions regarding time are important to consider as you begin to assemble the data for teaching your mode.

With the three questions answered, you’re ready to start building your ML Learning Dataset.

Steps for building your ML Learning Dataset

When you’re ready, the following steps provide a guide for how to start building your Learning Dataset.

  1. Find appropriate data. 
  2. Merge data into a single table to create your Primary Table, enrich it with secondary tables, and create your Target Variable.
  3. Conduct exploratory data analysis.
  4. Remove any target leakage.

This article primarily focuses on step 2 in the process outlined above. Refer to the other articles in this series that detail the remaining steps above.

1. Find appropriate data

When you’re ready to start sourcing data to teach your ML models, it’s imperative that you source good data from which your models can learn. You can be the most skilled data scientist in the room with access to a ton of data, but if your data isn‘t ‘good data’—meaning the kind of data that’s required for your models to learn well—then your ML project won’t succeed. For more details on this important first step, refer to the second article in this series, Best practices for sourcing data to teach your ML models.

2. Merge data into a single “Primary Table”, enrich with secondary tables, and create your Target Variable


2a) The Primary Table for building your ML Learning Dataset

After you’ve sourced all of the data you want to use for your business problem, Xavier Conort, DataRobot’s Chief Data Scientist and one of the world’s leading data scientists, advises you to create a “primary table,” which he defines as: 

“a cleaned version of a learning example... it should exclude all of the information that is not available at prediction time--except the Target Variable (column).”

Applying Xavier’s advice to the hospital readmission example, the first step is to source a dataset in which you have a solid learning example or unit of analysis. If you read the first article in this series, Welcome to data science, you’ll remember that a unit of analysis is the “for whom” or “what” of your prediction. So for the hospital readmission primary table, each row corresponds to a single patient and each column corresponds to features for each patient.

The next step is to create a “cleaned” version of this dataset, meaning if there are columns (features) in the dataset providing data that would only be know after the prediction time, then remove those columns--except for the target feature, which is also referred to as the “Target Variable.”

In the hospital readmission example, the primary table would look something like this—one row for each unique patient instance:


Next, following Xaver’s advice, identify any data that needs to be removed because it provides data that is only available after our prediction target. This kind of data is known as target leakage because it will skew your model’s ability to correctly learn. Put another way, leakage is kind of like cheating with your learning data because you’re providing the model with a feature whose value cannot be known at prediction time. We’ll cover target leakage, in more detail, later in this article.

For our hospital readmission data, notice there are two columns of data that capture information about each patient after the patient has returned to the hospital for readmission. These columns constitute leakage and so must be removed from our Primary Table:


IMPORTANT NOTE: what if one of the columns in the data provides the answer (the Target Variable) for the question you are asking of the data? Sometimes you may actually have data that has the answer to the question you want to predict—especially if the data you have is extracted from a table of past events. If the Target Variable already exists in your data, then you’re in luck and won’t need to perform the operational step(s) to create it. In the following example, the Target Variable already exists in the initial data we sourced for hospital admissions--this column has the data we need to answer for each patient: was the patient readmitted within 30 days of initial discharge?


If you don’t find Target Variable data in any of the data you source, not to worry. It’s common practice to actually create that variable yourself after you’ve enriched the Primary Table. We’ll cover that step a little later in this section.

Here are some other tips for building a good Primary Table:

  • A distribution of example rows that reflect the distribution you expect at prediction time--in other words, real-world data that reflects the real-world problem your prediction aims to solve.
    • Example: if you have a classification problem you are aiming to solve regarding whether or not a customer will purchase a particular product, then you need healthy examples of customers who did and customers who did not make a purchase. Additionally, if there is a seasonal component to the product, then you need to have enough event (purchase) history to cover multiple season cycles.

  • Avoid ‘example overlap’—for the hospital readmission example, there should be only one row per patient—not multiple instances of the same patient within the same time window.

  • Don’t perform fill-downs or aggregations on the data. In the screenshot example above, notice there are blank cells with missing information. By attempting to resolve those blanks, you are actually preventing the model from learning how to associate this missing information with other variables in the dataset. And as you continue enriching your Primary Table, the missing information may provide key correlations with other data you join into the Primary Table.

  • If there is a time or seasonal component to your prediction—like the hospital example; within 30 days, or the customer churn example within three weeks—then ensure there is enough history in the data to provide your model with enough examples.

  • If your Primary Table does not have any sort of date feature and you anticipate working with data type data, then it's advisable to create one in your Primary Table. This allows you to concretely know the point beyond which leakage can occur. Also, such a date feature will afford you the flexibility to do computations with the date.

  • The Primary Table—and in fact no data that’s used for teaching models--should have Target Leakage. This is such an important topic that we’ve devoted an entire best practices article to Target Leakage: how to recognize and prevent it.
2b) Enrich with secondary tables

When your clean Primary Table is complete, start enriching it with secondary tables—which are datasets that include the additional features you think are fundamental to teaching the models. Again, using the hospital readmission example, you may have datasets that include additional, important details about many, if not all, of the patients in your Primary Table—for example their age, gender, current medications, etc. In this case, you should join the data on a common key for both your Primary and secondary tables—for example Patient ID. In this way, you are building out your data from the Primary and enhancing it with new features.

Additionally, as you enrich the Primary with more data, you may also find that you can generate additional desired features (columns) by performing sums, subtraction, division, cosine similarity etc. on columns within the dataset.

Finally, when creating a Primary Table and enriching it, you should not prep your data as you enrich it. The objective of the enrichment step is to generate more features for the Learning Dataset—not to clean the data up as you go.

2c) Create your Target Variable, if it’s not already in your data

Once you have what you believe is a good dataset with enough rows, with enough variety of examples, and enough essential features (columns), it’s time to create the Target Variable column, if it doesn’t already exist in the data.

Sometimes the target is created through a simple lookup operation with another table that has the data. Sometimes it needs to be generated through a calculated column operation. And sometimes it needs to be generated by a complex SQL script, for example in the customer churn example we want to make a churn prediction for a three week window of time. For the hospital readmission example in which you have both an admit date and a readmit date, you can create a Target Variable column based on the logic: if readmit date is less than thirty days from admit date, then Y; if readmit date is greater than thirty days from admission date, then N.

3. Perform your data prep steps and exploratory analysis

After you have pulled all of your data together into a single Learning Dataset, that’s when you want to begin your data prep and exploration. Your data prep steps will likely include things like standardizing date formats and removing unwanted observations. Then you’ll perform your own exploratory analysis on the data to gain additional insights into how best you can finally prepare the data before you start feeding it to the models.

The next article in this best practices series, Data Prep and Exploratory Analysis on your Learning Dataset, takes a closer look at these exercises.

4. Find and remove Target Leakage

The final article in this best practice series, Target Leakage: how to recognize and prevent it, provides guidance on how to avoid introducing data leakage into your Learning Dataset. 

Labels (2)
Version history
Last update:
‎06-29-2020 04:06 PM
Updated by: