Automated Machine Learning (AutoML) Walkthrough

Overview

The DataRobot Automated Machine Learning product accelerates your AI success by combining cutting-edge machine learning technology with the team you have in place. The platform incorporates the knowledge, experience, and best practices of the world's leading data scientists, delivering unmatched levels of automation, accuracy, transparency, and collaboration to help your business become an AI-driven enterprise.

Use Case

This guide will demonstrate the basics of how to build, select, deploy, and monitor a regression or classification model using the automated machine learning capabilities of DataRobot. The potential application of these capabilities spans many industries: banking, insurance, healthcare, retail, and many more. The use case highlighted throughout these examples comes from the healthcare industry.

Healthcare providers understand that high hospital readmission rates spell trouble for patient outcomes. But excessive rates may also threaten a hospital’s financial health, especially in a value-based reimbursement environment. Readmissions are already one of the costliest episodes to treat, with hospital costs reaching $41.3 billion for patients readmitted within 30 days of discharge, according to the Agency for Healthcare Research and Quality (AHRQ).

The training dataset used throughout this document is from a research study that can be found online at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3996476/. The resulting models predict the likelihood that a discharged hospital patient will be readmitted within 30 days of their discharge.

Automated Regression & Classification Modeling

STEP 1: Load and Profile Your Data

To get started with DataRobot, you will log in and load a prepared training dataset. DataRobot currently supports .csv, .tsv, .dsv, .xls, .xlsx, .sas7bdat, .bz2, .gz, .zip, .tar, and .tgz file types, plus Apache Avro, Parquet, and ORC (Figure 1). (Note: If you wish to use Avro or Parquet data, contact your DataRobot representative for access to the feature.)

These files can be uploaded locally, from a URL or Hadoop/HDFS, or read directly from a variety of enterprise databases via JDBC. Directly loading data from production databases for model building allows you to quickly train and retrain models, and eliminates the need to export data to a file for ingestion.

Figure 1. Data ImportFigure 1. Data Import

DataRobot supports any database that provides a JDBC driver—meaning most databases in the market today can connect to DataRobot. Drivers for Postgres, Oracle, MySQL, Amazon Redshift, Microsoft SQL Server, and Hadoop Hive are most commonly used.

After you load your data, DataRobot performs exploratory data analysis (EDA) which detects the data types and determines the number of unique, missing, mean, median, standard deviation, and minimum and maximum values. This information is helpful for getting a sense of the dataset shape and distribution.

Start Modeling

STEP 2: Select a Prediction Target

Next, select a prediction target (the name of the column in your data that captures what you are trying to predict) from the uploaded database (Figure 2). DataRobot will analyze your training dataset and automatically determine the type of analysis (in this case, classification).

Figure 2. Target SelectionFigure 2. Target Selection

DataRobot automatically partitions your data. If you want to customize the model building process, you can modify a variety of advanced parameters, optimization metrics, feature lists, transformations, partitioning, and sampling options. The default modeling mode is “Autopilot,” which employs the full breadth of DataRobot’s automation capabilities. For more control over which algorithms DataRobot runs, there are manual and quick-run options.

STEP 3: Begin the Modeling Process

Click the Start button to begin training models. Once the modeling process begins, the platform further analyzes the training data to create the Importance column (Figure 3). This Importance grading provides a quick cue to better understand the most influential variables for your chosen prediction target.

Figure 3. FeaturesFigure 3. Features

Target Leakage

If DataRobot detects target leakage (i.e., information that would probably not be available at the time of prediction), the feature is marked with a warning flag in the Importance column. If the leaky feature is significantly correlated with the target, DataRobot will automatically omit it from the model building process. It also might flag features that suggest partial target leakage. For these features, you should ask yourself whether the information would be available at the time of prediction; if it will not, then remove it from your dataset to limit the risk of skewing your analysis.

You can easily see how many features contain useful information, and edit feature lists used for modeling.

You can also drill down on variables to view distributions, add features, and apply basic transformations.

DataRobot Modeling Strategy

DataRobot supports popular advanced machine learning techniques and open source tools such as Apache Spark, H2O, Scala, Python, R, TensorFlow, Facebook Prophet, Keras, DeepAR, Eureqa, and XGBoost. During the automated modeling process, it analyzes the characteristics of the training data and the selected prediction target and selects the most appropriate machine learning algorithms to apply. DataRobot optimizes data automatically for each algimorithm, performing operations like one-hot encoding, missing value imputation, text mining, and standardization to transform features for optimal results.

DataRobot streamlines model development by automatically ranking models (or ensembles of models) based on the techniques advanced data scientists use, including boosting, bagging, random forests, kernel-based methods, generalized linear models, deep learning, and many others. By cost-effectively evaluating a near-infinite combination of data transformations, features, algorithms, and tuning parameters in parallel across a large cluster of commodity servers, DataRobot delivers the best predictive model in the shortest amount of time.

STEP 4: Evaluate the Results of Automated Modeling

After automated modeling is complete, the Leaderboard will rank each machine learning model so you can evaluate and select the one you want to use (Figure 4). The models that are “Most Accurate,” “Fast & Accurate,” and “Recommended for Deployment” are tagged so you can begin your analysis there, or just move forward with them to deployment.

Figure 4. LeaderboardFigure 4. Leaderboard

If you select a model, and you see options for Evaluate, Understand, Describe, and Predict. To estimate possible model performance, the Evaluate options include industry standard Lift Chart, ROC Curve, Confusion Matrix, Feature Fit, and Advanced Tuning. There are also options for measuring models by Learning Curves, Speed versus Accuracy, and Comparisons. The interactive charts to evaluate models are very detailed, but don't require a background in data science in order to understand what they convey.

Transparency

STEP 5: Review how your Chosen Model Works

DataRobot offers superior transparency, interpretability, and explainability so you easily understand how models were built, and have the confidence to explain why a model made the prediction it did.

In the Describe tab, you can view the end-to-end model blueprint containing details of the specific feature engineering tasks and algorithms DataRobot uses to run the model (Figure 5). You can also review the size of the model and how long it ran, which may be important if you need to do low-latency scoring.

Figure 5. BlueprintFigure 5. Blueprint

In the Understand tab, popular exploratory capabilities include Feature Impact, Feature Effects, Prediction Explanations, and Word Cloud. These all help you understand what drives the model’s predictions.

Interpreting Models: Global Impact

Feature Impact measures how much each feature contributes to the overall accuracy of the model (Figure 6). For example, the reason why a patient was discharged from a hospital is directly related to the likelihood of a patient being readmitted. This insight can be invaluable for guiding your organization to focus on what matters most.


Figure 6: Feature ImpactFigure 6: Feature Impact

The Feature Effects chart displays model details on a per-feature basis (a feature's effect on the overall prediction), depicting how a model understands the relationship between each variable and the target (Figure 7). It provides specific values within each column that are likely large factors in determining whether someone will be readmitted to the hospital.

Figure 7. Feature EffectsFigure 7. Feature Effects

Interpreting Models: Local Impact

Prediction Explanations reveal the reasons why DataRobot generated a particular prediction for a data point so you can back up decisions with detailed reasoning (Figure 8). They provide a quantitative indicator of variable effect on individual predictions.

Figure 8. Prediction ExplanationsFigure 8. Prediction Explanations

The Insights tab provides more graphical representations of your model. There are tree-based variable rankings, hotspots, and variable effects to illustrate the magnitude and direction of a feature's effect on a model's predictions, and also text mining charts, anomaly detection, and a word cloud of keyword relevancy.

Interpreting Text Features

The Word Cloud tab provides a graphic of the most relevant words and short phrases in a word cloud format (Figure 9). The tab is only available for models trained with data that contains unstructured text.

Figure 9. Word CloudFigure 9. Word Cloud

STEP 6: Generate Model Documentation

DataRobot can automatically generate model compliance documentation—a detailed report containing an overview of the model development process, with full insight into the model assumptions, limitations, performance, and validation detail. This feature is ideal for organizations in highly regulated industries that have compliance teams that need to review all aspects of a model before it can be put into production. Of course, having this degree of transparency into a model has clear benefits for organizations in any industry.

Making Predictions

STEP 7: Make Predictions

Every model built in DataRobot is immediately ready for deployment. You can:

A. Upload a new dataset to DataRobot to be scored in batch and downloaded (Figure 10).

Figure 10. GUI PredictionsFigure 10. GUI Predictions

B. Create a REST API endpoint to score data directly from applications (Figure 11). An independent prediction server is available to support low latency, high throughput prediction requirements.

Figure 11. DeploymentFigure 11. Deployment

C. Export the model for in-place scoring in Hadoop (Figure 12).

Figure 12. HadoopFigure 12. Hadoop

D. Download scoring code, either as editable source code or self-contained executables, to embed directly in applications to speed up computationally intensive operations (Figure 13).

Figure 13. Scoring CodeFigure 13. Scoring Code

Monitor and Manage your models

STEP 8: Monitor and Manage Deployed Models

With DataRobot you can proactively monitor and manage all deployed machine learning models (including models created outside of DataRobot) to maintain peak prediction performance. This ensures that the machine learning models driving your business are accurate and consistent throughout changing market conditions.
At a glance you can view a summary of metrics from all models in production, including the number of requests (predictions) and key health statistics:

  • Service Health looks at core performance metrics from an operations or engineering perspective: latency, throughput, errors, and usage (Figure 14).

    FIgure 14. Service HealthFIgure 14. Service Health
  • Data Drift proactively looks for changes in the data characteristics over time to let you know if there are trends that could impact model reliability (Figure 15).

    You can also analyze data drift to assess if the model is reliable, even before you get the actual values back. You’re essentially analyzing how the data you’ve scored this model on differs from the data the model was trained on. DataRobot compares the most important features in the model (as measured by its Feature Impact score) and how different each feature’s distribution is from the training data.

    Green dots indicate features that haven't changed much. Yellow dots indicate features that have changed but aren't very important. You should examine these, but changes with these features don't necessarily mandate action, especially if you have lots of models. Red dots indicate important features that have drifted. The more red dots you have, the greater the likelihood that your model needs to be replaced.

    Figure 15. Data DriftFigure 15. Data Drift
  • Accuracy compares actual values (or ground truth) corresponding to our predictions so you can assess model performance using standard machine learning metrics (Figure 16).

    FIgure 16. AccuracyFIgure 16. Accuracy
    From here you can apply “embedded DataRobot data science” expertise to review model performance and detect model decay. By clicking on a model you can see how the predictions the model has made have changed over time. Dramatic changes here can indicate that your model has gone off track. 

Replacing a Model

If you decide to replace a model that’s drifted, simply paste the URL from a re-trained DataRobot model (a model trained on more recent data from the same data source), or from one that has compatible features. After DataRobot validates that the model matches you can select a reason why you made the replacement for a permanent archive. From this point forward, new prediction requests will go against the new model with no impact to downstream processes. If you ever decide to restore the previous model, you can easily do that through the same process.

Prediction Applications

Once you have deployed a model you can launch a prediction application. Simply go to the Applications tab and select an application (Figure 17).

Figure 17. ApplicationsFigure 17. Applications

This example shows how to launch the Predictor application. The first step is to click Launch in the Applications Gallery and fill out the fields below. Click Model from Deployment and indicate the deployment form which you want to make predictions (Figure 18). Then click Launch.

Figure 18. Launch DeploymentFigure 18. Launch Deployment

Once the launch is complete you will be taken to your Current Applications page. You will see your application listed; this may take a moment to finish. Now you can open the application and make predictions by filling out the relevant fields (Figure 19).

Figure 19. Create New RecordFigure 19. Create New Record

Conclusion

DataRobot’s regression and classification capabilities are available as a fully-managed software service (SaaS), or in several Enterprise configurations to match your business needs and IT requirements. All configurations feature a constantly expanding set of diverse, best-in-class algorithms from R, Python, H2O, Spark, and other sources, giving you the best set of tools for your machine learning and AI challenges.

(This community post references research from this article: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.)

Attachment: We've attached a PDF file of this article.

Labels (1)
Comments
NiCd Battery

Thank you so much for this Walktrough. However, I couldn't download the research study cited in this following link:

https://www.hindawi.com/journals/bmri/2014/781670/sup/

Do you have another option to download this study?

Once again, thank you very much for your time

Best!

Youness

NiCd Battery

Emily, I have another question, what is the disadvantages and advantages runing DataRobot manually?

Thank you very much

Have a great day!

Youness

Data Scientist
Data Scientist

Hi Dr. Youness,

 

That page is down temporarily.  You can find an abstract for the article on pubmed: https://pubmed.ncbi.nlm.nih.gov/24804245/. The full article is also here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3996476/ (we updated this community article with a link to the latter location).

 

By running DataRobot manually are you referring using the GUI vs. API or Manual Mode vs. Autopilot? 

 

Thanks, 

 

Emily

NiCd Battery

Thank you very much Emily. 

By running DataRobot manually I am referring using the: 
GUI vs. API
Manual Mode vs. Autopilot

On the other hand, How can I download all the documentation about DataRobot's algorithms?

Once again, thank you very much for your time

Best regards

Dr. Youness El Hamzaoui

Community Team
Community Team

Hi Doctor Youness - is your question perhaps answered by Duncan here: https://community.datarobot.com/t5/ai-platform-trial/codes/m-p/7415#M16 ?

“Unfortunately we don't offer that capability and likely will not in the short term for the above reasons I described. However, if you click on the Describe -> Blueprint tab you'll see our blueprints outlined as a series of rectangles. If you click on any of those rectangles, you'll link to our documentation which cites academic papers that describe the techniques would use. 

It's worth noting that our blueprints implement an entire ML pipeline not just a single algorithm. As a simple example, one of our blueprints uses unsupervised learning techniques, a supplementary linear classifier, and then feeds it all into a gradient boosted tree. 

We have had multiple researchers publish papers using DataRobot, and they have either cited our platform directly or gone further and cited the references we provide in our documentation. ”

———

That was tagged as a solution to the question. Trying to understand your question! Thanks 

-Linda 

NiCd Battery

Thank you very much lhaviland for your answer. However, you didn't answer my question!
My question is about How can I download all documentation about DataRobot's algorithms without taking into consideration if you'running an exercise or example?
and on the other hand, when running DataRobot manually what are the disadvantages and advantages when runing DataRobot manually? I mean:
GUI vs. API
Manual Mode vs. Autopilot
Once again, thank you very much for your time
Best regards
Dr. Youness

Community Team
Community Team

Sorry for the misunderstanding. I really was only attempting to answer this part of your question: "How can I download all the documentation about DataRobot's algorithms?" -- with a pointer to Duncan's reply on the other question.

As for the rest, someone else will have to chime in!

 

Data Scientist
Data Scientist

Dr. Youness, 

Those are great questions.  

First let's talk about Manual Mode vs. Autopilot.  Manual mode allows you to select individual blueprints from the repository, while autopilot tries out a wide variety of algorithms at different sample size.  Generally, my advice is to run autopilot first to see which types of algorithms rise to the top of the leaderboard and then run the same data in manual mode and use all of the blueprints from the top modeling type. 

As far as interacting with the GUI vs the API, there are strengths to using either strategy.  If you use the GUI you don't have to write any code and it's pretty easy to sail through a project using the built in evaluation/interpretability tools.  You can also easily score and download data.  

The API has both R and Python packages you can use to interact programmatically with DataRobot.  Interacting this way allows you to do much more complex projects, as well as generate easily reproducible results for those project.  You can do something like create a model factory if you interact programmatically.  We also have a GitHub Page that has code examples and tutorials to help get this started. 

I hope this answers your questions, thanks for reaching out  

 

Emily

NiCd Battery

Thank you so much Emily,

Could you explain with more details, please: what do you mean using the built in evaluation/interpretability tools? also could you give me an example?

On the other hand, I would like to work using API, so what is the starting point?

Once again, thank you very much Emily for your time

Best regards

Dr.Youness

NiCd Battery

Hi!

What is the difference between:
Feature Impact and Feature Effects according to GUI of DataRobot

On the hand, Can you explain the red numbers and the blue numbers and the characters +++; ---; ++; -; +; - about "PREDICTION EXPLAINATIONS" according to figure enclosed to this comment?

Thank you very muchCan you explain the red numbers and the blue numbers and the characters +++; ---; ++; -; +; -Can you explain the red numbers and the blue numbers and the characters +++; ---; ++; -; +; -

 

NiCd Battery

Hi!

I have already run my database about the hidraulic concrete using Eureqa. I have found the analitical formula:

RESISTENCIA REAL (kg/cm2) = exp(High Cardinality and Text features Modeling +0.01*(REVENIMIENTO TEÓRICO (cm)) + 0.00*(TEMPERATURA (°C)) + 0.00*(EDAD DE ENSAYO (días)) + 1.71114674903652e-5*(CARGA DE RUPTURA (kg)) - 0.74 - 0.00*(RESISTENCIA DE DISEÑO (kg/cm2)) - 0.01*(REVENIMIENTO REAL (cm))).

My question what does mean: exp(High Cardinality and Text features Modeling)

Awaiting your response

Thank you very much

 

Data Scientist
Data Scientist

Hi Dr. Youness, 

Evaluation and interpretation tools include a number of features in DataRobot.

Under the Evaluation tab you have the Lift Chart and the ROC Curve tab for classification problems or a Residuals Plot for regression problems.  You can find more information about these tools here

Under the Understand tab you can find Feature Impact, Feature Effects and Prediction Explanations.  We have an overview of this here

To address your other question on prediction explanations.  The visual that you pasted is simply a summary that shows the records ranked as the top 3 and bottom three using the evaluation metric.  So the rows with the red probability are at the highest end of the prediction distribution and are most likely to achieve the target, while the probabilities in blue are on the opposite end of the spectrum and are least likely to achieve the target.  The characters (+++,  ---) indicate whether or not the explanations for that particular row of data are pushing the score upwards (+) or downwards (-). The number of symbols indicates the strength of this relationship. More symbols means stronger predictive power. 

If you wanted to use the API, then the best place to start depends on how you are going to interact with it.  Are you planning on using R, Python or Curl requests to interact with DataRobot?

I hope this helps  

 

Emily

Data Scientist
Data Scientist

Hello again Dr. Youness, 

Your Eureqa question is a bit trickier.

When there are features with high cardinality - categorical features with lots of values or text fields, DataRobot does preprocessing on those features.  In order to get the coefficient for each category/text you will need to export the results to a CSV.  You can then see the coefficients you need to reproduce the model. 

 

Picture1.png

 

Does this help? 

 

Emily

NiCd Battery

Thank you so much Emily for your reply.

On other hand, 

1. Why I couldn't carry out Predictions Explanations. I've received this message:
"There are not enough rows in the validation partition. Please make sure the validation partition contains at least 100 rows" ?

2. I have a DataRobot Prime. Why I couldn't generate and run the code into the Python plateform?

NiCd Battery

Thank you so much Emily for your reply.

On other hand, 

1. Why I couldn't carry out Predictions Explanations. I've received this message:
"There are not enough rows in the validation partition. Please make sure the validation partition contains at least 100 rows" ?

2. I have a DataRobot Prime. Why I couldn't generate and run the code into the Python plateform?

Once again, thank you very much for your time

Best regards

Youness

Community Team
Community Team

Hi @Doctor Youness - The in-app Platform documentation may have the answer for your first question:

There must be at least 100 rows in the validation set for Prediction Explanations to compute.

This can be found in the doc here (for the trial).

NiCd Battery

Thank you very much

Best regards

NiCd Battery

Hi!

They said:

"DataRobot Prime allows you to download scoring code (Python, Java) for any model on your leaderboard and use it directly in your application. When run, it creates an approximation for that model".

but it is not true, I already have DataRobot Prime, I have tried to generate the code and downloaded it, but I have not been able to open it, so that I can start editing it, like what I am doing with Matlab!

I need a help about this issue, please!

Spoiler
I would like to to attach the code into this comment, but but the system does not allow me!

Awaiting your reply. Please!

 

NiCd Battery

Hi!

They said:

"DataRobot Prime allows you to download scoring code (Python, Java) for any model on your leaderboard and use it directly in your application. When run, it creates an approximation for that model".

but it is not true, I already have DataRobot Prime, I have tried to generate the code and downloaded it, but I have not been able to open it, so that I can start editing it, like what I am doing with Matlab!

I need a help about this issue, please!

Awaiting your reply. Please!

Best regards

Youness

 

 

NiCd Battery

Hi!

They said:

"DataRobot Prime allows you to download scoring code (Python, Java) for any model on your leaderboard and use it directly in your application. When run, it creates an approximation for that model".

but it is not true, I already have DataRobot Prime, I have tried to generate the code and downloaded it, but I have not been able to open it, so that I can start editing it, like what I am doing with Matlab!

I need a help about this issue, please!

Awaiting your reply. Please!

Best regards

Youness

Version history
Revision #:
17 of 17
Last update:
‎08-24-2020 03:12 PM
Updated by:
 
Contributors