02 Best practices: Sourcing data to teach your ML models

cancel
Showing results for 
Search instead for 
Did you mean: 


02 Best practices: Sourcing data to teach your ML models

This is the 2nd article in our Best Practices for Building ML Learning Datasets series. 

In this article you’ll learn:

  • How to source and organize the data for your Learning Dataset.
  • How to ensure your data is diverse and large enough.
  • Appropriate data types and file formats for your Learning Dataset.

When you’re ready to start sourcing data to train your ML models, it’s imperative that you source good data from which your models can learn. You can be the most skilled data scientist in the room with access to a ton of data for teaching your models, but if your data is not ‘good data’—meaning the kind of data that’s required to teach your models well—then your ML project won’t succeed. There are some fundamental best practices that you can follow to ensure the models are fed ‘good data.’ The purpose of this article is to review some of those practices.

Sourcing and organizing the data for your Learning Dataset

Once you have clearly articulated your business problem, prediction task and unit of analysis, you’re ready to start looking for data to teach your model. That data comes in the form of a “dataset”—which is simply a single database table or a data matrix in which every column represents a variable, or a “feature” of the dataset, and every row corresponds to a single observation or occurrence. Let’s use an example here to illustrate. If you’ve been following our other articles in this series, you’ll recognize the hospital readmission example:

dataprep-02-dataset1.png

 

Data Diversity and Depth

The next important consideration for your Learning Dataset is its diversities. Keep in mind that the Learning Dataset you are building is yours to build—meaning that the data doesn’t come from just one table or file. You are essentially collating various data sources into a single dataset that will become your own Learning Dataset. You are creating a unique dataset that has the features you believe are required to teach a model to make accurate predictions. This doesn’t necessarily mean your Learning Dataset has to be of a specific size. But you must have enough data to feed a model with enough features (represented as columns) and rows (units of occurence). Let’s take a look at some suggested guidelines for how this equates to the size of the dataset you ultimately create:

  • Start smaller, using data sampling techniques.

  • If you’re having trouble finding enough data, consider techniques discussed in Breaking the Curse of Small Datasets in Machine Learning to extract the most value from the data that is available to you.

  • No matter the size of your dataset, it’s important that you have a balance in your data. Otherwise, you don’t provide a balanced representation to feed your model. This can lead to what is termed as “class imbalance.” For a deeper dive on this issue, check out “Dealing with Imbalanced Classes in Machine Learning.”

Appropriate data types and file formats

Bearing all of the above in mind, where can you begin looking for the types of data you require to build your Learning Dataset? There are three major types of data:

  • Internal to your organization: this type of data is the basis for most modeling projects. It’s usually highly relevant and, hopefully, easy to obtain.

  • External 3rd party data: this includes data you can source online for a fee--for example, marketing survey results, credit reports from reporting agencies like Experian, etc.

  • Public data sources: these include ‘open sources’ for data--for example, census data, economic indicator data from the FRED (Federal Reserve Economic Data), weather information, LinkedIn, etc.

Finally, what formats are out there? What data formats can I use to create my Learning Dataset? Check out our Quick Help for Data Import article for details.

Once you’ve identified all of the data that you want to use for teaching a model, it’s time to collate it all into a single dataset. There are some important rules to follow when creating that dataset, and we have an article dedicated to that topic when you’re ready: Building a Learning Dataset for your ML model.

Labels (2)
Version history
Last update:
‎06-29-2020 04:05 PM
Updated by:
Contributors