You can find the latest information in the DataRobot public documentation. Also, click ? in-app to access the full platform documentation for your version of DataRobot.
(Updated October 2020)
This article summarizes how to solve a classification problem with DataRobot. Specifically, you’ll learn about importing data, exploratory data analysis, target selection, as well as modeling options, evaluation, interpretation, and deployment.
For this example, we are using a dataset from a readmissions use case where a hospital is trying to predict whether or not a patient is going to be readmitted within 30 days of a diabetic event. The hospital wants to be able to predict this so that they can prevent themselves from discharging patients too early.
This is an historical dataset with a known outcome for our target feature. Within this dataset, different rows represent patients, and columns (or features) represent information about those patients. Some of these columns represent demographic features while others represent clinical features. The target column, “readmitted,” is a binary true/false variable, and provides us with a binary classification problem. DataRobot will identify the different feature types and do the appropriate preprocessing steps before building models. Notably, you can do anything shown in the GUI with our Python, R, or REST APIs. You can find resources for this on our Community Github.
There are a number of different ways that you can get data into DataRobot. The first is to connect to DataSource, which is basically anything with a JDBC connection. The second is to use a URL, like an AWS S3 bucket. The third is to connect to Hadoop, and the fourth is to upload a local file. Finally, we also have an AI Catalog where you can store and share your data.
After you import the data, you can scroll down in the Data tab to see the features present within the imported dataset. DataRobot identifies the different feature types and provides some summary statistics about them.
If you’re curious about any feature in the database, simply click on it and a distribution will drop down with details.
Use the Data Quality Assessment to help you better understand your data. The toggle lets you filter data to focus on columns with potential issues.
When you are done importing and exploring your features, the next step is to identify the target. To do this, simply by scroll up and enter it into the text field (as shown below). DataRobot will identify the problem type and give you a distribution of the target.
At this point, you could simply hit the Start button to run Quick Autopilot; however, there are some defaults that you can customize before building models.
For example, under Advanced Options > Advanced, you can change the optimization metric:
Under the Partitioning tab you can also change the partitioning. By default, DataRobot uses a five-fold cross-validation and 20% holdout. This controls for both sampling bias and overfitting.
Once you are happy with the modeling options and have pressed Start, DataRobot creates 30–40 models; it does this through a process of building something called blueprints (see the following figure). Blueprints are a set of preprocessing steps and modeling techniques specifically assembled to best fit the shape and distribution of your data. Every model that the platform creates contains a blueprint.
DataRobot selects the blueprints it is going to create from a repository of open source and proprietary algorithms. This includes models like XGBoost, random forests, LightGBM, neural networks, and more. The platform will start running an array of models on a small portion of the data; the models that do the best will survive the first round of modeling and get fed more data. The models that do well from the next round will get fed even more data, and so on. In this way, you will test out a variety of modeling approaches to find the best solution for your problem. In addition to different algorithms, DataRobot will also try out different preprocessing strategies and hyperparameter settings for the models.
The Leaderboard uses the optimization metric to rank the built. DataRobot uses this metric to measure the performance of the model as it tries out different techniques as well as different hyperparameter settings.
If you select one of the models and click Evaluate > ROC Curve, you will find a collection ofata science metrics typically used to evaluate models. This includes the confusion matrix and associated metrics, the ROC Curve, and Prediction Distributions. (You can find more information on model evaluation here.)
There is also a Profit Curve tool that you can use to optimize where you're putting the prediction threshold. You can find this under Evaluate > Profit Curve. Adjusting the threshold here allows you to adjust sensitivity of the model and see the impact on profit. You can also add custom values for the different outcomes in the confusion matrix to really optimize your specific scenario.
Once you have evaluated your model, the next thing you want to do is understand how the features are impacting your predictions.You can find a set of interpretability tools in the Understand tab.
Below you can see an image of Feature Impact. You can find this in the Understand > Feature Impact tab. This tool allows you to see which features are most important in your model. There are no black boxes in DataRobot. The platform uses model-agnostic approaches to determine feature impact. This means that for every model that you build within DataRobot, you have the option to create a feature impact analysis.
Under the Understand > Feature Effects tab you can see how the different features are impacting your predictions. DataRobot achieves this by calculating another model-agnostic metric called partial dependence.
Feature Impact and Feature Effects show us the global impact of features on our target. Another interpretation tool called Prediction Explanations shows us the local impact of the features on our target. You can find this under the Understand > Prediction Explanations tab. Here you will find a sample of row-by-row explanations that tell you the reason for the prediction, which is very useful for communicating modeling results to non-data scientists. Someone who has domain expertise should be able to look at these specific examples and understand what is happening. You can get these for every row within your dataset.
Once you understand your model, the next step is to make predictions on new data where you don't know the outcome. You do this in the Predict tab.
There are a few different ways to get predictions out of DataRobot. The first way is the simplest. You can use the GUI to import the data directly from a local file or data source under the Predict > Make Predictions tab.
Then you can simply calculate the predictions and download them from the GUI. This is typically used for ad-hoc analysis or situations where you don't have to run the predictions on a regular basis.
DataRobot also gives you the ability to export scoring code in Java or Python using Codegen. You can find this under the Predict > Downloads or Understand > Downloads tabs. You can use the downloaded code to score the data outside of DataRobot. Customers who want to score their data off of a network or at a very low latency tend to use this option.
Creating a Deployment object is the most common way to set up your prediction workflow. It provides a very fast way to get models into production.This allows you to deploy to an API endpoint. You can get this REST endpoint as a Docker container that you host or use a DataRobot dedicated prediction server. With either approach you get a deployment object and can track things like service, health, and data drift.
If you click on the Predict > Deploy tab, you can create a deployment.
When you create a deployment object you unlock the functionality of DataRobot MLOPs. MLOps allows you to monitor and replace your deployments from the Deployments tab. Here you can monitor the number of deployments you have as well as the number of predictions you are making. You also have a summary of service health, data drift, and accuracy.
You can see the details of your deployments by clicking on them. If you click on one of your deployments, you are immediately taken to an overview page that gives you a summary, the content, and the version history of the deployment.
You can very easily make predictions directly from the GUI under the Predictions tab of your deployment.
Once you’ve made your predictions you can monitor the service health, data drift, and accuracy of the deployment. Importantly, you can set up notifications that tell you when your deployment needs attention and set up governance procedures for reviews and approvals.
Service Health tells you how much the deployment is being used and if any errors are occurring.
Data Drift tracks whether the data you are scoring on is fundamentally different from the data you trained your model on. This allows you to retrain your models strategically, based on the data, rather than on a monthly or quarterly basis.
Accuracy tracks how accurate your model is over time. This allows you to communicate and track the value of your models to key stakeholders.
See the related video that shows how to solve a classification problem with DataRobot (Release 6.2).