01 Welcome to data science: Where do I begin?

cancel
Showing results for 
Search instead for 
Did you mean: 


01 Welcome to data science: Where do I begin?

This is the 1st article in our Best Practices for Building ML Learning Datasets series.

In this article you’ll learn:

  • How to articulate the business problem you need to solve.
  • How to acquire subject matter expertise to assist in you creating a strategy for solving the problem.
  • How to define the essential elements required to build your first ML Learning Dataset.

Are you beginning your professional journey as a citizen data scientist? Perhaps your background as a business intelligence professional or SQL analyst has led you to your current citizen data scientist role? Or, are you thinking about a career in data science? If so, you probably won’t be surprised to hear the Gartner group predicts that “citizen data scientists will surpass data scientists in the amount of advanced analysis they produc...

The purpose of this article is to address some of the fundamental questions that every citizen data scientist must ask before even beginning to work on building predictive models—questions that allow you to clearly understand and articulate the business problem you want to solve through AI and predictive analytics.

Before jumping into the business problem you need to solve and the data you’ll need to teach your Machine Learning models, let’s take a big step back and look at the entire machine learning life cycle:

dataprep-Machine Learning Life Cycle Card 2019-04-29 (New MLLC).png

DataRobot and DataRobot Paxata can help you almost every step of the way in this Life Cycle. But before you can even get the Life Cycle for your project off the ground, you’ve got to define some very salient project objectives. And the three that are highlighted above are the ones we’ll review in this article.

1. Specify the business problem

It’s essential to define the business problem that you want to solve—and define it in very specific ways. In short, the problem statement that you articulate will define the data to acquire for the purpose of teaching your models. So, when you’re thinking about your problem statement, ask yourself:

  • What do I want to predict?
  • For whom?
  • When?

For example:

  • I want to predict where a 10th grade student will attend university in the fall of 2022, and I want to predict that now.
  • I want to predict if a discharged patient will be readmitted to the hospital during the 30 days following discharge—for the same issue—and I want to predict this at the time of the patient’s discharge.

We’ve reviewed a couple of solid examples for business problems above. But sometimes it’s also helpful to see an example of something that misses the mark. Here’s an example of a problem statement that is not specific enough:

Readmissions cost our hospital $65M last year, and we don’t have a way to determine which patients are at risk of readmission.

Notice the problem statement is missing actionable details. Sure, it’s a fact that readmissions are expensive so it would be great to have a way to determine readmission rates in advance. But how to go about this? Notice in the example of good problem statements, we are predicting readmission “for the same issue”. Which leads us to ask: what’s the issue? If we can get insights or find data that highlights patterns of readmission rates—for example diabetic patients have a high readmission rate for issues related to managing their diabetes—then we know we are getting very precise in our problem statement. Which then informs the next step in our project objectives: acquiring subject matter expertise.

2. Acquire subject matter expertise to assist in creating a strategy for solving the problem

Expert insights are essential before you even begin to build datasets to teach your models. These insights can come in the form of existing data that may be available to you or through persons in the organization who have particular business knowledge required to solve the problem.

Let’s go back to the well-defined readmission example:

I want to predict if a discharged patient will be readmitted to hospital during the 30 days following discharge—for the same issue—and I want to predict this at the time of the patient’s discharge.

It’s not sufficient to simply look at readmissions data alone. In fact, it’s entirely possible that any patient can be readmitted to hospital for an entirely different health issue. Perhaps diabetes management brought a patient in for the first admission, but a car accident brought the patient back. With no pattern or common variable for readmission, it’s difficult, if not impossible, to spot which types of the patients are likely to be readmitted. This is your clue to speak with subject matter experts at the hospital—perhaps even people who work in the admissions department—to see if they have noticed a pattern for the types of patients who are being readmitted. The pattern that gets noticed can then be vetted with data that you can request. If you’re told that it seems diabetic patients are readmitted often for issues related to managing diabetes, then you can use the data to back up that claim. And with the supposition backed up by actual data, you can now move towards the very interesting business problem of predicting which of those diabetic patients will be readmitted to the hospital, for issues related to their diabetes, within 30 days of their initial discharge. Now that you know, precisely, your business problem, you’re ready to define your prediction target and unit of analysis.

3. Define your prediction task and unit of analysis for your Learning Dataset

Now it’s time to dig into some data science terminology. With your business problem clearly articulated, you need to define your prediction task and unit of analysis because these directly inform the data you need to source in order to build your Learning Dataset. But what are these?

  • Prediction task: This is *what you want to predict*. Using the hospital readmission example, your prediction task is to identify if a diabetic patient will be readmitted within 30 days of discharge for issues related to diabetes. The task therefore is answering “Yes” or “No” for each patient in the prediction analysis.

    Note that sometimes you’ll hear “prediction task” used interchangeably with prediction target and target variable. The prediction target is simply what you want to predict with your prediction task. And the target variable is simply the 'variable' (or column) in a Learning Dataset that provides the historical record of what actually happened. Eg, if a person actually was readmitted to the hospital. That's how the model learns - by seeing examples of what occured. For a few more technical details, see the Target Variable wiki page.
  • Unit of analysis: This the *for whom* or *what* of your prediction. Again, circling back to the hospital readmission data, the *for whom* is

i.  A diabetes patient
ii. Who has been discharged from the hospital

Notice how specific we articulate the unit of analysis. This precisely defines the kind of data we’ll need to source—data for diabetic patients who have been recently discharged, in which each row of your data represents a single record for a patient. And for each row, our prediction will be a binary “yes” or “no” regarding a patient’s readmission status to the hospital.

Note: Though you may come across “unit of observation” as a term that is defined as a subset of “unit of analysis,” these two terms are used interchangeably when working with DataRobot.

  • Learning Dataset: This is the initial dataset you create in order to feed your models so that they can learn. After the models are fed data from your Learning Dataset, DataRobot presents a leaderboard that lists the top-performing models. You can explore these models to compare their accuracy scores and then further explore each model to review the importance, or impact, each variable (feature) has on making the prediction. You will then begin iterating on your initial Learning Dataset to create a Training Dataset that ultimately becomes the Prediction Dataset you deploy to production.

    Note that the articles in this series are directed at assisting you in building your first Learning Dataset. When you are ready to start refining that data to continue training models from the Leaderboard, then you’ll want to begin your journey towards understanding the DataRobot Models.
Putting it all together

Up to this point we’ve used hospital readmission as our working example. Here are a few more examples of business problems that are well-suited to predictive analytics. For each problem, see if you can identify: a strategy for solving the problem, the prediction task, and a unit of analysis.

dataprep-01-critique-1.png

dataprep-01-critique-2.png

When you’re ready to start sourcing and preparing the data you’ll need to satisfy your prediction task and unit of analysis, carry on to the next article in this series: Best practices for sourcing data to teach your ML models.

If you want a deeper dive on more data science concepts, be sure to explore the other articles here in our Community and also check out the Artificial Intelligence wiki.

Labels (2)
Version history
Last update:
‎06-29-2020 04:05 PM
Updated by:
Contributors