How to source and organize the data for your Learning Dataset.
How to ensure your data is diverse and large enough.
Appropriate data types and file formats for your Learning Dataset.
When you’re ready to start sourcing data to train your ML models, it’s imperative that you source good data from which your models can learn. You can be the most skilled data scientist in the room with access to a ton of data for teaching your models, but if your data is not ‘good data’—meaning the kind of data that’s required to teach your models well—then your ML project won’t succeed. There are some fundamental best practices that you can follow to ensure the models are fed ‘good data.’ The purpose of this article is to review some of those practices.
Sourcing and organizing the data for your Learning Dataset
Once you have clearly articulated your business problem, prediction task and unit of analysis, you’re ready to start looking for data to teach your model. That data comes in the form of a “dataset”—which is simply a single database table or a data matrix in which every column represents a variable, or a “feature” of the dataset, and every row corresponds to a single observation or occurrence. Let’s use an example here to illustrate. If you’ve been following our other articles in this series, you’ll recognize the hospital readmission example:
Data Diversity and Depth
The next important consideration for your Learning Dataset is its diversities. Keep in mind that the Learning Dataset you are building is yours to build—meaning that the data doesn’t come from just one table or file. You are essentially collating various data sources into a single dataset that will become your own Learning Dataset. You are creating a unique dataset that has the features you believe are required to teach a model to make accurate predictions. This doesn’t necessarily mean your Learning Dataset has to be of a specific size. But you must have enough data to feed a model with enough features (represented as columns) and rows (units of occurence). Let’s take a look at some suggested guidelines for how this equates to the size of the dataset you ultimately create:
No matter the size of your dataset, it’s important that you have a balance in your data. Otherwise, you don’t provide a balanced representation to feed your model. This can lead to what is termed as “class imbalance.” For a deeper dive on this issue, check out “Dealing with Imbalanced Classes in Machine Learning.”
Appropriate data types and file formats
Bearing all of the above in mind, where can you begin looking for the types of data you require to build your Learning Dataset? There are three major types of data:
Internal to your organization: this type of data is the basis for most modeling projects. It’s usually highly relevant and, hopefully, easy to obtain.
External 3rd party data: this includes data you can source online for a fee--for example, marketing survey results, credit reports from reporting agencies like Experian, etc.
Public data sources: these include ‘open sources’ for data--for example, census data, economic indicator data from the FRED (Federal Reserve Economic Data), weather information, LinkedIn, etc.
Finally, what formats are out there? What data formats can I use to create my Learning Dataset? Check out our Quick Help for Data Import article for details.
Once you’ve identified all of the data that you want to use for teaching a model, it’s time to collate it all into a single dataset. There are some important rules to follow when creating that dataset, and we have an article dedicated to that topic when you’re ready: Building a Learning Dataset for your ML model.