The DataRobot Automated Machine Learning product accelerates your AI success by combining cutting-edge machine learning technology with the team you have in place. The platform incorporates the knowledge, experience, and best practices of the world's leading data scientists, delivering unmatched levels of automation, accuracy, transparency, and collaboration to help your business become an AI-driven enterprise.
This guide will demonstrate the basics of how to build, select, deploy, and monitor a regression or classification model using the automated machine learning capabilities of DataRobot. The potential application of these capabilities spans many industries: banking, insurance, healthcare, retail, and many more. The use case highlighted throughout these examples comes from the healthcare industry.
Healthcare providers understand that high hospital readmission rates spell trouble for patient outcomes. But excessive rates may also threaten a hospital’s financial health, especially in a value-based reimbursement environment. Readmissions are already one of the costliest episodes to treat, with hospital costs reaching $41.3 billion for patients readmitted within 30 days of discharge, according to the Agency for Healthcare Research and Quality (AHRQ).
The training dataset used throughout this document is from a research study that can be found online at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3996476/. The resulting models predict the likelihood that a discharged hospital patient will be readmitted within 30 days of their discharge.
To get started with DataRobot, you will log in and load a prepared training dataset. DataRobot currently supports .csv, .tsv, .dsv, .xls, .xlsx, .sas7bdat, .bz2, .gz, .zip, .tar, and .tgz file types, plus Apache Avro, Parquet, and ORC (Figure 1). (Note: If you wish to use Avro or Parquet data, contact your DataRobot representative for access to the feature.)
These files can be uploaded locally, from a URL or Hadoop/HDFS, or read directly from a variety of enterprise databases via JDBC. Directly loading data from production databases for model building allows you to quickly train and retrain models, and eliminates the need to export data to a file for ingestion.
DataRobot supports any database that provides a JDBC driver—meaning most databases in the market today can connect to DataRobot. Drivers for Postgres, Oracle, MySQL, Amazon Redshift, Microsoft SQL Server, and Hadoop Hive are most commonly used.
After you load your data, DataRobot performs exploratory data analysis (EDA) which detects the data types and determines the number of unique, missing, mean, median, standard deviation, and minimum and maximum values. This information is helpful for getting a sense of the dataset shape and distribution.
Next, select a prediction target (the name of the column in your data that captures what you are trying to predict) from the uploaded database (Figure 2). DataRobot will analyze your training dataset and automatically determine the type of analysis (in this case, classification).
DataRobot automatically partitions your data. If you want to customize the model building process, you can modify a variety of advanced parameters, optimization metrics, feature lists, transformations, partitioning, and sampling options. The default modeling mode is “Autopilot,” which employs the full breadth of DataRobot’s automation capabilities. For more control over which algorithms DataRobot runs, there are manual and quick-run options.
Click the Start button to begin training models. Once the modeling process begins, the platform further analyzes the training data to create the Importance column (Figure 3). This Importance grading provides a quick cue to better understand the most influential variables for your chosen prediction target.
If DataRobot detects target leakage (i.e., information that would probably not be available at the time of prediction), the feature is marked with a warning flag in the Importance column. If the leaky feature is significantly correlated with the target, DataRobot will automatically omit it from the model building process. It also might flag features that suggest partial target leakage. For these features, you should ask yourself whether the information would be available at the time of prediction; if it will not, then remove it from your dataset to limit the risk of skewing your analysis.
You can easily see how many features contain useful information, and edit feature lists used for modeling.
You can also drill down on variables to view distributions, add features, and apply basic transformations.
DataRobot supports popular advanced machine learning techniques and open source tools such as Apache Spark, H2O, Scala, Python, R, TensorFlow, Facebook Prophet, Keras, DeepAR, Eureqa, and XGBoost. During the automated modeling process, it analyzes the characteristics of the training data and the selected prediction target and selects the most appropriate machine learning algorithms to apply. DataRobot optimizes data automatically for each algimorithm, performing operations like one-hot encoding, missing value imputation, text mining, and standardization to transform features for optimal results.
DataRobot streamlines model development by automatically ranking models (or ensembles of models) based on the techniques advanced data scientists use, including boosting, bagging, random forests, kernel-based methods, generalized linear models, deep learning, and many others. By cost-effectively evaluating a near-infinite combination of data transformations, features, algorithms, and tuning parameters in parallel across a large cluster of commodity servers, DataRobot delivers the best predictive model in the shortest amount of time.
After automated modeling is complete, the Leaderboard will rank each machine learning model so you can evaluate and select the one you want to use (Figure 4). The models that are “Most Accurate,” “Fast & Accurate,” and “Recommended for Deployment” are tagged so you can begin your analysis there, or just move forward with them to deployment.
If you select a model, and you see options for Evaluate, Understand, Describe, and Predict. To estimate possible model performance, the Evaluate options include industry standard Lift Chart, ROC Curve, Confusion Matrix, Feature Fit, and Advanced Tuning. There are also options for measuring models by Learning Curves, Speed versus Accuracy, and Comparisons. The interactive charts to evaluate models are very detailed, but don't require a background in data science in order to understand what they convey.
DataRobot offers superior transparency, interpretability, and explainability so you easily understand how models were built, and have the confidence to explain why a model made the prediction it did.
In the Describe tab, you can view the end-to-end model blueprint containing details of the specific feature engineering tasks and algorithms DataRobot uses to run the model (Figure 5). You can also review the size of the model and how long it ran, which may be important if you need to do low-latency scoring.
In the Understand tab, popular exploratory capabilities include Feature Impact, Feature Effects, Prediction Explanations, and Word Cloud. These all help you understand what drives the model’s predictions.
Feature Impact measures how much each feature contributes to the overall accuracy of the model (Figure 6). For example, the reason why a patient was discharged from a hospital is directly related to the likelihood of a patient being readmitted. This insight can be invaluable for guiding your organization to focus on what matters most.
The Feature Effects chart displays model details on a per-feature basis (a feature's effect on the overall prediction), depicting how a model understands the relationship between each variable and the target (Figure 7). It provides specific values within each column that are likely large factors in determining whether someone will be readmitted to the hospital.
Prediction Explanations reveal the reasons why DataRobot generated a particular prediction for a data point so you can back up decisions with detailed reasoning (Figure 8). They provide a quantitative indicator of variable effect on individual predictions.
The Insights tab provides more graphical representations of your model. There are tree-based variable rankings, hotspots, and variable effects to illustrate the magnitude and direction of a feature's effect on a model's predictions, and also text mining charts, anomaly detection, and a word cloud of keyword relevancy.
The Word Cloud tab provides a graphic of the most relevant words and short phrases in a word cloud format (Figure 9). The tab is only available for models trained with data that contains unstructured text.
DataRobot can automatically generate model compliance documentation—a detailed report containing an overview of the model development process, with full insight into the model assumptions, limitations, performance, and validation detail. This feature is ideal for organizations in highly regulated industries that have compliance teams that need to review all aspects of a model before it can be put into production. Of course, having this degree of transparency into a model has clear benefits for organizations in any industry.
Every model built in DataRobot is immediately ready for deployment. You can:
A. Upload a new dataset to DataRobot to be scored in batch and downloaded (Figure 10).
B. Create a REST API endpoint to score data directly from applications (Figure 11). An independent prediction server is available to support low latency, high throughput prediction requirements.
C. Export the model for in-place scoring in Hadoop (Figure 12).
D. Download scoring code, either as editable source code or self-contained executables, to embed directly in applications to speed up computationally intensive operations (Figure 13).
With DataRobot you can proactively monitor and manage all deployed machine learning models (including models created outside of DataRobot) to maintain peak prediction performance. This ensures that the machine learning models driving your business are accurate and consistent throughout changing market conditions.
At a glance you can view a summary of metrics from all models in production, including the number of requests (predictions) and key health statistics:
If you decide to replace a model that’s drifted, simply paste the URL from a re-trained DataRobot model (a model trained on more recent data from the same data source), or from one that has compatible features. After DataRobot validates that the model matches you can select a reason why you made the replacement for a permanent archive. From this point forward, new prediction requests will go against the new model with no impact to downstream processes. If you ever decide to restore the previous model, you can easily do that through the same process.
Once you have deployed a model you can launch a prediction application. Simply go to the Applications tab and select an application (Figure 17).
This example shows how to launch the Predictor application. The first step is to click Launch in the Applications Gallery and fill out the fields below. Click Model from Deployment and indicate the deployment form which you want to make predictions (Figure 18). Then click Launch.
Once the launch is complete you will be taken to your Current Applications page. You will see your application listed; this may take a moment to finish. Now you can open the application and make predictions by filling out the relevant fields (Figure 19).
DataRobot’s regression and classification capabilities are available as a fully-managed software service (SaaS), or in several Enterprise configurations to match your business needs and IT requirements. All configurations feature a constantly expanding set of diverse, best-in-class algorithms from R, Python, H2O, Spark, and other sources, giving you the best set of tools for your machine learning and AI challenges.
(This community post references research from this article: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.)
Attachment: We've attached a PDF file of this article.