This is the 3rd article in our Best Practices for Building ML Learning Datasets series.
In this article you’ll learn:
When you are ready to start building your Learning Dataset, ask yourself:
Let’s look at how to answer these questions for concrete examples.
Notice that we have a time component in both concrete examples above. Also notice that in the second example, we are defining a window of time, not a specific moment in time.
However, what you want to predict may not always require a time feature. For example, a classic linear regression problem that forecasts the future cost of real estate—but at no specific time in the future, just in the ongoing window of time—does not require a “for when” to be answered. But as a best practice, it’s always good to ask yourself the “when” for your prediction to ensure you’re very precise in understanding and articulating the problem you want to solve. All of these dimensions regarding time are important to consider as you begin to assemble the data for teaching your mode.
With the three questions answered, you’re ready to start building your ML Learning Dataset.
When you’re ready, the following steps provide a guide for how to start building your Learning Dataset.
This article primarily focuses on step 2 in the process outlined above. Refer to the other articles in this series that detail the remaining steps above.
When you’re ready to start sourcing data to teach your ML models, it’s imperative that you source good data from which your models can learn. You can be the most skilled data scientist in the room with access to a ton of data, but if your data isn‘t ‘good data’—meaning the kind of data that’s required for your models to learn well—then your ML project won’t succeed. For more details on this important first step, refer to the second article in this series, Best practices for sourcing data to teach your ML models.
After you’ve sourced all of the data you want to use for your business problem, Xavier Conort, DataRobot’s Chief Data Scientist and one of the world’s leading data scientists, advises you to create a “primary table,” which he defines as:
“a cleaned version of a learning example... it should exclude all of the information that is not available at prediction time--except the Target Variable (column).”
Applying Xavier’s advice to the hospital readmission example, the first step is to source a dataset in which you have a solid learning example or unit of analysis. If you read the first article in this series, Welcome to data science, you’ll remember that a unit of analysis is the “for whom” or “what” of your prediction. So for the hospital readmission primary table, each row corresponds to a single patient and each column corresponds to features for each patient.
The next step is to create a “cleaned” version of this dataset, meaning if there are columns (features) in the dataset providing data that would only be know after the prediction time, then remove those columns--except for the target feature, which is also referred to as the “Target Variable.”
In the hospital readmission example, the primary table would look something like this—one row for each unique patient instance:
Next, following Xaver’s advice, identify any data that needs to be removed because it provides data that is only available after our prediction target. This kind of data is known as target leakage because it will skew your model’s ability to correctly learn. Put another way, leakage is kind of like cheating with your learning data because you’re providing the model with a feature whose value cannot be known at prediction time. We’ll cover target leakage, in more detail, later in this article.
For our hospital readmission data, notice there are two columns of data that capture information about each patient after the patient has returned to the hospital for readmission. These columns constitute leakage and so must be removed from our Primary Table:
IMPORTANT NOTE: what if one of the columns in the data provides the answer (the Target Variable) for the question you are asking of the data? Sometimes you may actually have data that has the answer to the question you want to predict—especially if the data you have is extracted from a table of past events. If the Target Variable already exists in your data, then you’re in luck and won’t need to perform the operational step(s) to create it. In the following example, the Target Variable already exists in the initial data we sourced for hospital admissions--this column has the data we need to answer for each patient: was the patient readmitted within 30 days of initial discharge?
If you don’t find Target Variable data in any of the data you source, not to worry. It’s common practice to actually create that variable yourself after you’ve enriched the Primary Table. We’ll cover that step a little later in this section.
Here are some other tips for building a good Primary Table:
When your clean Primary Table is complete, start enriching it with secondary tables—which are datasets that include the additional features you think are fundamental to teaching the models. Again, using the hospital readmission example, you may have datasets that include additional, important details about many, if not all, of the patients in your Primary Table—for example their age, gender, current medications, etc. In this case, you should join the data on a common key for both your Primary and secondary tables—for example Patient ID. In this way, you are building out your data from the Primary and enhancing it with new features.
Additionally, as you enrich the Primary with more data, you may also find that you can generate additional desired features (columns) by performing sums, subtraction, division, cosine similarity etc. on columns within the dataset.
Finally, when creating a Primary Table and enriching it, you should not prep your data as you enrich it. The objective of the enrichment step is to generate more features for the Learning Dataset—not to clean the data up as you go.
Once you have what you believe is a good dataset with enough rows, with enough variety of examples, and enough essential features (columns), it’s time to create the Target Variable column, if it doesn’t already exist in the data.
Sometimes the target is created through a simple lookup operation with another table that has the data. Sometimes it needs to be generated through a calculated column operation. And sometimes it needs to be generated by a complex SQL script, for example in the customer churn example we want to make a churn prediction for a three week window of time. For the hospital readmission example in which you have both an admit date and a readmit date, you can create a Target Variable column based on the logic: if readmit date is less than thirty days from admit date, then Y; if readmit date is greater than thirty days from admission date, then N.
After you have pulled all of your data together into a single Learning Dataset, that’s when you want to begin your data prep and exploration. Your data prep steps will likely include things like standardizing date formats and removing unwanted observations. Then you’ll perform your own exploratory analysis on the data to gain additional insights into how best you can finally prepare the data before you start feeding it to the models.
The next article in this best practices series, Data Prep and Exploratory Analysis on your Learning Dataset, takes a closer look at these exercises.
The final article in this best practice series, Target Leakage: how to recognize and prevent it, provides guidance on how to avoid introducing data leakage into your Learning Dataset.