This is the 1st article in our Best Practices for Building ML Learning Datasets series.
In this article you’ll learn:
Are you beginning your professional journey as a citizen data scientist? Perhaps your background as a business intelligence professional or SQL analyst has led you to your current citizen data scientist role? Or, are you thinking about a career in data science? If so, you probably won’t be surprised to hear the Gartner group predicts that “citizen data scientists will surpass data scientists in the amount of advanced analysis they produc...
The purpose of this article is to address some of the fundamental questions that every citizen data scientist must ask before even beginning to work on building predictive models—questions that allow you to clearly understand and articulate the business problem you want to solve through AI and predictive analytics.
Before jumping into the business problem you need to solve and the data you’ll need to teach your Machine Learning models, let’s take a big step back and look at the entire machine learning life cycle:
DataRobot and DataRobot Paxata can help you almost every step of the way in this Life Cycle. But before you can even get the Life Cycle for your project off the ground, you’ve got to define some very salient project objectives. And the three that are highlighted above are the ones we’ll review in this article.
It’s essential to define the business problem that you want to solve—and define it in very specific ways. In short, the problem statement that you articulate will define the data to acquire for the purpose of teaching your models. So, when you’re thinking about your problem statement, ask yourself:
We’ve reviewed a couple of solid examples for business problems above. But sometimes it’s also helpful to see an example of something that misses the mark. Here’s an example of a problem statement that is not specific enough:
Readmissions cost our hospital $65M last year, and we don’t have a way to determine which patients are at risk of readmission.
Notice the problem statement is missing actionable details. Sure, it’s a fact that readmissions are expensive so it would be great to have a way to determine readmission rates in advance. But how to go about this? Notice in the example of good problem statements, we are predicting readmission “for the same issue”. Which leads us to ask: what’s the issue? If we can get insights or find data that highlights patterns of readmission rates—for example diabetic patients have a high readmission rate for issues related to managing their diabetes—then we know we are getting very precise in our problem statement. Which then informs the next step in our project objectives: acquiring subject matter expertise.
Expert insights are essential before you even begin to build datasets to teach your models. These insights can come in the form of existing data that may be available to you or through persons in the organization who have particular business knowledge required to solve the problem.
Let’s go back to the well-defined readmission example:
I want to predict if a discharged patient will be readmitted to hospital during the 30 days following discharge—for the same issue—and I want to predict this at the time of the patient’s discharge.
It’s not sufficient to simply look at readmissions data alone. In fact, it’s entirely possible that any patient can be readmitted to hospital for an entirely different health issue. Perhaps diabetes management brought a patient in for the first admission, but a car accident brought the patient back. With no pattern or common variable for readmission, it’s difficult, if not impossible, to spot which types of the patients are likely to be readmitted. This is your clue to speak with subject matter experts at the hospital—perhaps even people who work in the admissions department—to see if they have noticed a pattern for the types of patients who are being readmitted. The pattern that gets noticed can then be vetted with data that you can request. If you’re told that it seems diabetic patients are readmitted often for issues related to managing diabetes, then you can use the data to back up that claim. And with the supposition backed up by actual data, you can now move towards the very interesting business problem of predicting which of those diabetic patients will be readmitted to the hospital, for issues related to their diabetes, within 30 days of their initial discharge. Now that you know, precisely, your business problem, you’re ready to define your prediction target and unit of analysis.
Now it’s time to dig into some data science terminology. With your business problem clearly articulated, you need to define your prediction task and unit of analysis because these directly inform the data you need to source in order to build your Learning Dataset. But what are these?
i. A diabetes patient
ii. Who has been discharged from the hospital
Notice how specific we articulate the unit of analysis. This precisely defines the kind of data we’ll need to source—data for diabetic patients who have been recently discharged, in which each row of your data represents a single record for a patient. And for each row, our prediction will be a binary “yes” or “no” regarding a patient’s readmission status to the hospital.
Note: Though you may come across “unit of observation” as a term that is defined as a subset of “unit of analysis,” these two terms are used interchangeably when working with DataRobot.
Up to this point we’ve used hospital readmission as our working example. Here are a few more examples of business problems that are well-suited to predictive analytics. For each problem, see if you can identify: a strategy for solving the problem, the prediction task, and a unit of analysis.
When you’re ready to start sourcing and preparing the data you’ll need to satisfy your prediction task and unit of analysis, carry on to the next article in this series: Best practices for sourcing data to teach your ML models.
If you want a deeper dive on more data science concepts, be sure to explore the other articles here in our Community and also check out the Artificial Intelligence wiki.