Hospital Readmissions (Binary Classification)

cancel
Showing results for 
Search instead for 
Did you mean: 


Hospital Readmissions (Binary Classification)

You can find the latest information in the DataRobot public documentation. Also, click ? in-app to access the full platform documentation for your version of DataRobot.


(Updated October 2020)

Problem Framing and Historical Data

This article summarizes how to solve a classification problem with DataRobot. Specifically, you’ll learn about importing data, exploratory data analysis, target selection, as well as modeling options, evaluation, interpretation, and deployment.

For this example, we are using a dataset from a readmissions use case where a hospital is trying to predict whether or not a patient is going to be readmitted within 30 days of a diabetic event. The hospital wants to be able to predict this so that they can prevent themselves from discharging patients too early.

This is an historical dataset with a known outcome for our target feature. Within this dataset, different rows represent patients, and columns (or features) represent information about those patients. Some of these columns represent demographic features while others represent clinical features. The target column, “readmitted,” is a binary true/false variable, and provides us with a binary classification problem. DataRobot will identify the different feature types and do the appropriate preprocessing steps before building models. Notably, you can do anything shown in the GUI with our Python, R, or REST APIs. You can find resources for this on our Community Github

Figure 1. Snapshot of the training datasetFigure 1. Snapshot of the training dataset

Importing and Exporting Data

There are a number of different ways that you can get data into DataRobot. The first is to connect to DataSource, which is basically anything with a JDBC connection. The second is to use a URL, like an AWS S3 bucket. The third is to connect to Hadoop, and the fourth is to upload a local file. Finally, we also have an AI Catalog where you can store and share your data.

Figure 2. Data import optionsFigure 2. Data import options

Exploratory Data Analysis

After you import the data, you can scroll down in the Data tab to see the features present within the imported dataset. DataRobot identifies the different feature types and provides some summary statistics about them.

Figure 3. EDAFigure 3. EDA

If you’re curious about any feature in the database, simply click on it and a distribution will drop down with details.

Figure 4. HistogramFigure 4. Histogram

Use the Data Quality Assessment to help you better understand your data. The toggle lets you filter data to focus on columns with potential issues.

Figure 5. Data QualityFigure 5. Data Quality

Target Selection

When you are done importing and exploring your features, the next step is to identify the target. To do this, simply by scroll up and enter it into the text field (as shown below). DataRobot will identify the problem type and give you a distribution of the target. 

Figure 6. Target Selection exampleFigure 6. Target Selection example

Modeling Options

At this point, you could simply hit the Start button to run Quick Autopilot; however, there are some defaults that you can customize before building models.

For example, under Advanced Options > Advanced, you can change the optimization metric:

Figure 7. Advanced OptionsFigure 7. Advanced Options

Under the Partitioning tab you can also change the partitioning.  By default, DataRobot uses a five-fold cross-validation and 20% holdout. This controls for both sampling bias and overfitting. 

Figure 8. PartitioningFigure 8. Partitioning

Model Building

Once you are happy with the modeling options and have pressed Start, DataRobot creates 30–40 models; it does this through a process of building something called blueprints (see the following figure). Blueprints are a set of preprocessing steps and modeling techniques specifically assembled to best fit the shape and distribution of your data. Every model that the platform creates contains a blueprint.

Figure 9. BlueprintFigure 9. Blueprint

DataRobot selects the blueprints it is going to create from a repository of open source and proprietary algorithms. This includes models like XGBoost, random forests, LightGBM, neural networks, and more. The platform will start running an array of models on a small portion of the data; the models that do the best will survive the first round of modeling and get fed more data. The models that do well from the next round will get fed even more data, and so on. In this way, you will test out a variety of modeling approaches to find the best solution for your problem. In addition to different algorithms, DataRobot will also try out different preprocessing strategies and hyperparameter settings for the models.

Model Evaluation

The Leaderboard uses the optimization metric to rank the built. DataRobot uses this metric to measure the performance of the model as it tries out different techniques as well as different hyperparameter settings.


Figure 10. LeaderboardFigure 10. Leaderboard

If you select one of the models and click Evaluate > ROC Curve, you will find a collection ofata science metrics typically used to evaluate models. This includes the confusion matrix and associated metrics, the ROC Curve, and Prediction Distributions. (You can find more information on model evaluation here.)

lhaviland_4-1601988312529.png

Figure 11. ROC CurveFigure 11. ROC Curve

There is also a Profit Curve tool that you can use to optimize where you're putting the prediction threshold. You can find this under Evaluate > Profit Curve. Adjusting the threshold here allows you to adjust sensitivity of the model and see the impact on profit. You can also add custom values for the different outcomes in the confusion matrix to really optimize your specific scenario.

Figure 12. Profit CurveFigure 12. Profit Curve

Model Insights

Once you have evaluated your model, the next thing you want to do is understand how the features are impacting your predictions.You can find a set of interpretability tools in the Understand tab.

Below you can see an image of Feature Impact. You can find this in the Understand > Feature Impact tab. This tool allows you to see which features are most important in your model. There are no black boxes in DataRobot. The platform uses model-agnostic approaches to determine feature impact. This means that for every model that you build within DataRobot, you have the option to create a feature impact analysis.

Figure 13. Feature ImpactFigure 13. Feature Impact

Under the Understand > Feature Effects tab you can see how the different features are impacting your predictions. DataRobot achieves this by calculating another model-agnostic metric called partial dependence.

Figure 14. Feature EffectsFigure 14. Feature Effects

Feature Impact and Feature Effects show us the global impact of features on our target. Another interpretation tool called Prediction Explanations shows us the local impact of the features on our target. You can find this under the Understand > Prediction Explanations tab. Here you will find a sample of row-by-row explanations that tell you the reason for the prediction, which is very useful for communicating modeling results to non-data scientists. Someone who has domain expertise should be able to look at these specific examples and understand what is happening. You can get these for every row within your dataset.

Figure 15. Prediction ExplanationsFigure 15. Prediction Explanations

Deployment

Once you understand your model, the next step is to make predictions on new data where you don't know the outcome. You do this in the Predict tab.

There are a few different ways to get predictions out of DataRobot. The first way is the simplest. You can use the GUI to import the data directly from a local file or data source under the Predict > Make Predictions tab.

Figure 16. Make PredictionsFigure 16. Make Predictions

Then you can simply calculate the predictions and download them from the GUI. This is typically used for ad-hoc analysis or situations where you don't have to run the predictions on a regular basis.

DataRobot also gives you the ability to export scoring code in Java or Python using Codegen. You can find this under the Predict > Downloads or Understand > Downloads tabs. You can use the downloaded code to score the data outside of DataRobot. Customers who want to score their data off of a network or at a very low latency tend to use this option.

Figure 17. CodegenFigure 17. Codegen

Creating a Deployment object is the most common way to set up your prediction workflow. It provides a very fast way to get models into production.This allows you to deploy to an API endpoint. You can get this REST endpoint as a Docker container that you host or use a DataRobot dedicated prediction server. With either approach you get a deployment object and can track things like service, health, and data drift.

If you click on the Predict > Deploy tab, you can create a deployment.

Figure 18. DeployFigure 18. Deploy

Monitoring Deployments with MLOPs

When you create a deployment object you unlock the functionality of DataRobot MLOPs. MLOps allows you to monitor and replace your deployments from the Deployments tab. Here you can monitor the number of deployments you have as well as the number of predictions you are making. You also have a summary of service health, data drift, and accuracy. 

Figure 19. Deployments tabFigure 19. Deployments tab

You can see the details of your deployments by clicking on them. If you click on one of your deployments, you are immediately taken to an overview page that gives you a summary, the content, and the version history of the deployment. 

Figure 20. Deployment Overview tabFigure 20. Deployment Overview tab

You can very easily make predictions directly from the GUI under the Predictions tab of your deployment. 

Figure 21. Deployment Predictions tabFigure 21. Deployment Predictions tab

Once you’ve made your predictions you can monitor the service health, data drift, and accuracy of the deployment. Importantly, you can set up notifications that tell you when your deployment needs attention and set up governance procedures for reviews and approvals.

Figure 22. Service HealthFigure 22. Service Health

Service Health tells you how much the deployment is being used and if any errors are occurring. 

Figure 23. Data DriftFigure 23. Data Drift

Data Drift tracks whether the data you are scoring on is fundamentally different from the data you trained your model on. This allows you to retrain your models strategically, based on the data, rather than on a monthly or quarterly basis. 

Figure 24. AccuracyFigure 24. Accuracy

Accuracy tracks how accurate your model is over time. This allows you to communicate and track the value of your models to key stakeholders. 

More Information

See the related video that shows how to solve a classification problem with DataRobot (Release 6.2).

Comments
Devire
Jumper Wires

How does the Make Predictions tab work? Would I upload the same dataset but withoutu the TRUE/FALSE field? 

emily
Data Scientist
Data Scientist

Hi Devire, 

 

There are a few ways to make predictions on this tab.  You can upload the a new dataset without a target column.  Then you can compute and download the predictions in the GUI. 

If you you want to compute predictions on the training dataset it's best if you click on the "compute predictions" button for the training dataset.  If you do it this way, DataRobot will know that you are using the training data and will do something called "stacked predictions".  This will ensure your predictions are out of sample, even though you trained the model on that dataset.  There is an article on this as well Deployment—Make Predictions Tab

emily_0-1597681835485.png

I hope this helps!  Feel free to add anymore questions, 

 

Emily

Version history
Last update:
‎09-13-2021 02:05 PM
Updated by:
Contributors