Automated Machine Learning Walkthrough—for Business Analysts

cancel
Showing results for 
Search instead for 
Did you mean: 

Automated Machine Learning Walkthrough—for Business Analysts

Overview

The DataRobot Automated Machine Learning product accelerates your AI success by combining cutting-edge machine learning technology with the team you have in place. The product incorporates the knowledge, experience, and best practices of the world's leading data scientists, delivering unmatched levels of automation, accuracy, transparency, and collaboration to help your business become an AI-driven enterprise.

This guide will demonstrate to you, a business analyst, the basics for how to build a regression or classification model using the automated machine learning capabilities of DataRobot and uncover insights into your data using the various functionality available in the product. Note that this guide does not cover the deployment of models; for that information, you can refer to A Complete Deployment Workflow for DataRobot Models.  

This guide is not intended to teach you data science; rather, it is intended to facilitate your use of DataRobot and enable you to incorporate it into your workflow without getting overwhelmed with data science principles. We recognize that it is impossible to anticipate every question you may have or the next step you’d like to take in this journey.  We also understand that any single topic in this guidebook could be a full-fledged book. Nevertheless, we hope that you find this guide to be a simple roadmap on the core functionality of DataRobot as it relates to two very common types of models: regression and classification analyses (typically using historical data stored in Excel spreadsheets). The potential applications of these types of models are vast and span many industries—banking, insurance, healthcare, retail, and many more—and any organization function, from human resources, to marketing, to the business strategy team.  Undoubtedly, you have many opportunities to use these types of models within your organization. 

At the end of this guide we provide a glossary to help you quickly grasp DataRobot terminology and data science concepts. Terms in blue throughout the guide are in the Glossary. 

DataRobot’s Automation

In a nutshell, here’s how DataRobot works: a user uploads a dataset to DataRobot and picks a target variable—the name of one of the columns—based on the practical business problem they wish to solve. The product automatically applies best practices for data preparation and preprocessing, feature engineering, and model training and validation. It then selects the most appropriate algorithms based on the data and target variable (problem being addressed). After training models, DataRobot ranks them according to their accuracy with the most appropriate model for the business problem at the top of the list. Feature-rich interpretation tools ensure business analysts can understand the models. DataRobot automatically uncovers potentially overlooked insights and keeps guardrails in place so you can trust the predictions made.

DataRobot:

  • Supports automatic feature engineering on numeric, categorical, text, and date fields.
  • Automatically partitions data into the necessary training, validation, and holdout datasets. 
  • Provides automated model training and validation in AutoPilot mode to guide or automatically select the best variables for input into a predictive algorithm, the process or set of rules that the computer will follow. 
  • Maintains a Leaderboard that shows how all trained models perform against an accuracy measuring statistic like LogLoss or area under the curve (AUC).
  • Includes a number of built-in accuracy evaluation visualizations to determine if a model can be trusted. These include confusion matrix, Lift Chart, and ROC Curve visualizations. It then presents the charts in a downloadable folder.

What is a regression model? 

A model describes the relationship between a target variable (which is what the model is predicting) and other features (aka independent variables). We call the model a regression model when the target variable is numerical (or continuous), such as sales volume for a particular product. In other words, a regression model predicts a quantity.

What is a classification model?

On the other hand, when the target is categorical (or discrete), we often call the model a classification model. A classification model, as the name suggests, helps classify a particular instance to different categories. When there are two categories, we call this binary classification model; if the number of categories is greater than two, this is often referred to as a multiclass classification model. Typical examples include predicting whether a patient is going to be readmitted in the coming month or not (binary classification) or what type of product a customer is more likely to buy (multiclass classification). In other words, a classification model predicts a label.

Example dataset and use case: hospital readmissions

The use case that will be highlighted throughout this guide comes from the healthcare industry. Healthcare providers understand that high hospital readmission rates spell trouble for patient outcomes. Excessive rates may also threaten a hospital’s financial health, especially in a value-based reimbursement environment. Readmissions are already one of the costliest episodes to treat, with hospital costs reaching $41.3 billion for patients readmitted within 30 days of discharge, according to a research study by the Agency for Healthcare Research and Quality (AHRQ). 

The training dataset used throughout this guide is from the study and can be found under “Supplementary Materials,” as shown below, at https://www.hindawi.com/journals/bmri/2014/781670/sup/. The resulting models predict the likelihood that a discharged hospital patient will be readmitted within 30 days of their discharge. 

KarinAISD_0-1592277497993.png

Step 1: Getting Ready

But first, put aside the sample use case and think about how you will incorporate AI and DataRobot into your business analyst workflow. The most successful projects are those that include some up-front planning and framing of the question you seek to solve. The work you do before pushing the Start button in DataRobot is as important as what comes after. This section touches briefly on these topics.  

(For deeper dives into each topic, see these community posts: Best Practices for Building ML “Learning Datasets” and Paxata for AI/ML Use-cases: Best Practices.)

Framing Your Business Problem

To get the most value from AI, you need to understand what you are trying to accomplish and why.  While AI is very interesting and exciting, you don't want to be doing experiments just for the sake of doing them. You want to determine if an automated process will help you make better predictions or get richer insights. 

The first step is defining the problem. Here are the criteria that you should use:

  • State a problem using the language of business (without using technical jargon).
  • Specify actions that might result.
  • Include specific details (number of customers affected, costs, etc.).
  • Explain the impact to the bottom line.

You also need to determine your target. This is what you seek to predict or the column in the dataset about which you want to gain a deeper understanding. It is also the business outcome that you want to be able to predict in the future.

Note: You may hear the terms target variable, dependent variable, response, and outcome. They are frequently used interchangeably and in this guide (and in-app Platform Documentation) we generally use “target variable.”

Identifying Your Team 

Aside from you, think about who else should be involved. Be specific about roles and responsibilities.

  • Once you have new insights or information from your model, who are the people within your organization who will use it or rely on it?
  • Do you need additional subject matter experts (SMEs)?
  • Do you need external partners, such as a DataRobot Data Scientist or Success Manager?
  • Who owns the data you want to use? If not you, do you need the cooperation and/or expertise of the data owners to help you obtain the data and understand it?

Preparing Your Data  

It’s important to evaluate your data before loading it into DataRobot for several reasons.  

You want to make sure the data contains the information needed to address the business problem that you are seeking to solve. You also will want to ensure that the data is formatted and labeled in a way that makes it easier to understand later.

DataRobot will use the names of the column headings throughout the project. Consider whether to relabel your columns to make them more meaningful. For example, in a dataset of dairy products, a column labeled “Product12x557h” is rather cryptic; you could change it to better describe the product, such as “Plain_Yogurt_8 oz.”  This can make the information easier to grasp by downstream users of the information; for example, if you export documents or charts out of DataRobot then you want consumers of that analysis to understand the data. As one customer put it, “the language of data science as used within DataRobot can be new to some users, and so better descriptions within your column names before the data is loaded help bring some simplicity and familiarity to the project screens in the interface.”  

Consider whether you might have redundant information in your data that is labeled differently in different columns. Which column is better to use?  Can the other one be removed? DataRobot will look for data redundancy and advise you if it occurs; however, it makes things go faster if you remove it before loading and then you can avoid making changes after the data has already gone through the modeling process.  

Tip: You may like to explore a dataset using different targets. There’s a simple way to streamline that work: make a copy of the project (from the Manage Projects page) and re-run it with the second target.  

Automated Data Preparation in DataRobot

With respect to Excel, DataRobot performs some basic work on the dataset, described below. If DataRobot has any challenges reading the dataset, it alerts you so that you have the opportunity to correct the problem and reload the data.

  • For each column, DataRobot will calculate and inform you of the number of rows that have missing values.
    • Missing values that are fed into the model-building process will be automatically processed by DataRobot using suitable modeling techniques.
  • Duplicate columns or columns with missing titles are flagged by DataRobot for your awareness.
  • DataRobot will attempt to infer the data type for each column (numeric, categorical, date, text, etc.); you can review and override this automation on the project Data page.
  • The result of calculated Excel formulas will be interpreted as static, raw values.
  • Identifier columns that have a unique value for every row are marked for your awareness.

Step 2: Load Your Data 

Figure 1 shows the DataRobot home page shown when you log in. As discussed previously, to solve a business problem using machine learning you need data and this is exactly what DataRobot is asking for you to provide to get started.

DataRobot currently supports .csv, .tsv, .dsv, .xls, .xlsx, .sas7bdat, .bz2, .gz, .zip, .tar, and .tgz file types, plus Apache Avro, Parquet, and ORC. (If you wish to use Avro or Parquet data, contact your DataRobot representative for access to the feature.) Files of these types can be uploaded from a local drive location, from a URL or Hadoop/HDFS, or read directly from a variety of enterprise databases via JDBC. Directly loading data from production databases for model building allows you to quickly train and retrain models, and eliminates the need to export data to a file for ingestion. 

Figure 1. Data ImportFigure 1. Data Import

DataRobot supports any database that provides a JDBC driver; this means that most databases in the market today can connect to DataRobot. Drivers for Postgres, Oracle, MySQL, Amazon Redshift, Microsoft SQL Server, and Hadoop Hive are most commonly used. 

As soon as you load your data, DataRobot will create a new project and then do the exact same thing that an experienced data scientist would do: perform an Exploratory Data Analysis (EDA). During this step in the data science process, you inspect the data you're going to use to build models, evaluate its quality, and explore the different variables (or features) in your data to get better insights. Using DataRobot for this exploratory data analysis step enables automated instant insights, which would take much longer using traditional tools like Microsoft Excel. This information is helpful for getting a sense of the dataset shape and distribution as it is very important to further understand your dataset and ensure that there are no data quality issues with your dataset prior to beginning the predictive modeling process. 

When the initial EDA completes, DataRobot displays the Start screen (Figure 2). Although you can specify the target feature to use for predictions at this point, the best practice is to review the EDA insights first. This will ensure that you are comfortable with the data that you have imported and allow you to confirm that there are no data quality issues. For this you can scroll down or click the Explore (dataset name) link at the bottom to view a data summary.  

Figure 2. Start screenFigure 2. Start screen

Exploring your Data

The Data page presents data on a table format (Figure 3).  In the far left column, you'll see each feature name. In the Var Type column, you can see the data type that DataRobot automatically assigned. Further to the right, you'll see some basic summary statistics, such as the number of unique and missing values in each feature, as well as the minimum, maximum, and mean for the numeric features.

Figure 3. Data ExplorationFigure 3. Data Exploration

You can see more detailed information by clicking on any feature in the list. This opens up to a histogram that shows the feature values grouped into bins on the X-axis, and a count of the number of rows within each bin on the Y-axis (Figure 4). Essentially, this shows you the distribution of that feature. You can also view a plot of the most frequent values, and can view the same information in a table format, ordered by most frequent value count.

Figure 4. HistogramFigure 4. Histogram

Step 3: Select a Prediction Target 

Now that you've explored your features, you are ready to pick the feature you want to use for the prediction as the target (the name of the column in your data that captures what you are trying to predict) from the uploaded dataset (Figure 5).

DataRobot analyzes your training dataset and automatically determines the type of problem you are trying to solve (in this case, a classification problem); based on this, DataRobot recommends an appropriate optimization metric. Optimization is one of the essential ingredients in the recipe of machine learning algorithms. It starts with defining some kind of loss function/cost function and ends with minimizing it. In simpler words, you can think of the optimization metric as the variable that the machine learning models will try to minimize (or maximize) so that you end up with the best performing (lowest error/highest accuracy) machine learning models. 

Figure 5. Target SelectionFigure 5. Target Selection

Additionally, after the target is entered, the Show Advanced Options link appears at the bottom of the page (Figure 6).  

Figure 6. Show Advanced OptionsFigure 6. Show Advanced Options

The Advanced Options (Figure 7) allow you to set a variety of configurations, including the optimization metric to use for modeling, different partitioning schemes, downsampling, and many more. Please note that the default settings provide guardrails enabling less experienced data scientists, engineers, analysts, etc. to proceed with building excellent models without advanced understanding or additional configuration. However, DataRobot does provide fine-grain control for users who would like to manually specify those settings.

Figure 7. Advanced OptionsFigure 7. Advanced Options

Step 4: Begin the Modeling Process 

To begin the modeling process, simply click the Start button. This will kick off the default modeling mode, which we call “Autopilot.” This employs the full breadth of DataRobot’s automation capabilities. (Note that you can also customize Autopilot through Advanced Options.)

During this step, DataRobot looks back across the dataset to gather information about it: number of rows, number of columns and column types, information on what is inside each of these columns, and so on. DataRobot will create the training and testing partitions and initiate the second round of Exploratory Data Analysis (EDA2). At this time it also analyzes the target column and how the target relates to the other features of the dataset. DataRobot then uses all of this information to dynamically create 30 to 40 different modeling strategies that will work well with the dataset.

Before we go into the details of the modeling strategies, let’s explore EDA2. Having identified a target allows DataRobot to compare each feature to the target and derive additional target-based insights, namely the degree of correlation of each feature to the target and to other features.

A new column, Importance, appears in the middle of the page with a green bar that indicates the degree of correlation with the target—the more green, the higher the correlation. Notice that the feature order has been sorted in order of correlation (Figure 8). (Tip: You can change sort order by clicking on any header name.) If you hover over a green bar you will see the computed correlation values, along with a link to further documentation.  

Feature importance is a great way to identify which features are most related to the target. When you start with tens or hundreds of features, by using feature importance you can quickly identify a handful of features that have the strongest relationships to your target feature.

Figure 8. EDA2Figure 8. EDA2

If DataRobot detects target leakage (information that would probably not be available at the time of prediction), the feature is marked with a warning flag in the Importance column. If the leaky feature is significantly correlated with the target, DataRobot will automatically omit it from the model building process. It also might flag features that suggest partial target leakage. For those, you should ask yourself whether the information would be available at the time of prediction and if not, exclude it so that you limit the risk of skewing your analysis.  (For some help determining target leakage, see What is Target Leakage and How do I Avoid it?)

You can easily see how many features contain useful information, and edit feature lists used for modeling. 

Feature Association Matrix

What is it? 

The Feature Associations pane provides a matrix to help you track and visualize associations within your data. This information is derived from different metrics that: 1) help determine the extent to which features depend on each other; and 2) provide a protocol that partitions features into separate clusters or “families.”

Where is it?  

Click the Feature Association tab on the Data page.

To see relationships in terms of correlations between features, click Feature Associations in the upper left corner of the page.  The table displayed is a matrix that compares every pair combination of the top features from the Informative Features list, as well as groups of features denoted by color.  Each cell for a given feature pair is a color gradient from black to a bright color.

The page displays a matrix with an accompanying details pane for more specific information on clusters, general associations, and association pairs. From the details pane, you can view associations and relationships between specific feature pairs. Below the matrix is a set of matrix controls to modify the view.  

The Feature Associations matrix (Figure 9) provides information on association strength between pairs of numeric and categorical features (that is, num/cat, num/num, cat/cat) and feature clusters. Clusters, families of features denoted by color on the matrix, are features partitioned into groups based on their similarity.  Clustering allows you to quickly understand the strength and nature of the associations and detect families of pairwise association clusters.  You can click and drag to select a subsection of the matrix for a zoomed-in view.

Figure 9. Feature Associations MatrixFigure 9. Feature Associations Matrix

If you want to further understand the relationship between two of your features, you can do so by clicking on the View Feature Association Pairs (Figure 10). This will display plots of the individual association between the two features of a feature pair. From the resulting insights, you can see the values that are impacting the calculation or the “metrics of association.”

Figure 10. Feature Associations PairsFigure 10. Feature Associations Pairs

Step 5: Evaluate the Results of Automated Modeling 

After you start model building, all models being built or queued (to be built) appear on the right side of the screen in the Worker/Modeling Queue.

The Worker Queue is broken into two parts: models in progress (PROCESSING) and models in the queue (IN QUEUE). For each in-progress model, DataRobot displays a live report of CPU and RAM usage (Figure 11).


Figure 11. Worker/Modeling QueueFigure 11. Worker/Modeling Queue

This is where the real DataRobot magic is happening. In Figure 11, we see that two models are fitting in parallel, but DataRobot lets us crank it up depending on how many workers we have available (4 total workers in this case) so we can get done with the modeling a bit faster. As you increase these workers, you are literally multiplying your horsepower. Some of our customers put a limit on how many workers can be used in order to spread the horsepower over a team. If there are no limits on your use of workers, we suggest always dialing it up to the maximum number. 

As the fitting is completed, the models are moved over to the Models tab (Figure 12).

Figure 12: Models Tab/LeaderboardFigure 12: Models Tab/Leaderboard

Here’s our model Leaderboard!

Each item on the Leaderboard item represents a different modeling approach. DataRobot incorporates popular advanced machine learning techniques and open source tools such as Apache Spark, H2O, Scala, Python, R, TensorFlow, Facebook Prophet, Keras, DeepAR, Eureqa, XGBoost, and so on. During the automated modeling process, it analyzes the characteristics of the training data and the selected prediction target, and selects the most appropriate machine learning algorithms to apply. DataRobot optimizes data automatically for each algorithm, performing operations like one-hot encoding, missing value imputation, text mining, and standardization to transform features for optimal results.  By “optimizing,” we mean that each algorithm wants the data in a different format (for example, it needs numerical information in a certain format). Performing the operations listed above is normally a manual, time-consuming process for data scientists, but DataRobot automatically does it for you. (See this community article for more information about DataRobot’s automated feature engineering techniques.)

Advanced techniques applied automatically

DataRobot also streamlines model development by automatically ranking models (or blended models/ensembles of models) based on the techniques advanced data scientists use, including boosting, bagging, random forests, kernel-based methods, generalized linear models, deep learning, and many others. By cost-effectively evaluating a near-infinite combination of data transformations, features, algorithms, and tuning parameters in parallel across a large cluster of commodity servers, DataRobot delivers the best predictive model in the shortest amount of time. 

In short, the approach that DataRobot takes here is to go out and get the best open source algorithms and our own proprietary models, pair them up with appropriate pre-processing steps based on the data, and then make them compete against each other.  This helps eliminate selection bias that occurs naturally with any experienced data scientist who might be more comfortable with one type of model over another, therefore likely limiting the potential solutions to only models that the data scientist can build him or herself. DataRobot allows you to try out different kinds of approaches that you might not otherwise attempt. 

On the Leaderboard, you will notice that DataRobot is ranking these models from most to least accurate according to the default optimization metric or accuracy measuring statistic, LogLoss. The key point to know about LogLoss is that it measures error, so the smaller value, the better.  

Best model? Survival of the fittest

The approach we are taking here at DataRobot is what we call the "survival of the fittest."

DataRobot will start by fitting about 16% of the data into those 30 to 40 modeling approaches it chose. Once that round is complete and we have the initial performance of these models, DataRobot will double the amount of data it uses and half the number of models that performed the best---- so only the best ones will make it to next round-- and the process will repeat.  Once the 32% round ends, DataRobot again doubles the amount of data used and re-runs the top 10 models.

This process continues and until DataRobot has found the best performing model in the quickest time possible. And of course, if there are any specific models that didn't make the next round but you want to run with more data, you can easily do it with simple clicks or even better go to the repository and run any model that is available for your dataset and the problem you are trying to solve.

Blending models

DataRobot’s last step is to try some ensembling of models (what we call blenders) which can improve your accuracy even more. See the Blenders and Ensembles community article for more information.

If you look closely you will notice that DataRobot has also automated the feature selection process and has run different feature lists against the models. As DataRobot further understands which features are the most important based on their predictive power, it automatically tries to use a subset(s) of the features to train new models. This sometimes can improve the performance of the models as well as increase their efficiency.  

After automated modeling is complete the next step would be to understand how these models are actually treating your data, evaluating their performance, and taking a deeper look at how the different features are impacting the models predictions. To gain these and more insights, DataRobot provides various options for you. The model that is “Recommended for Deployment” is tagged so you can begin your analysis there.

Click on a model to find options to Evaluate, Understand, Describe, and Predict. (Note: There are also options for measuring models by Learning Curves, Speed vs Accuracy, and Comparisons. The interactive charts to evaluate models are very detailed, but don't require a background in data science in order to understand what they convey.)

Figure 13. Models Tab/LeaderboardFigure 13. Models Tab/Leaderboard

Prior to deploying or getting predictions from a model, we strongly recommend viewing the tabs within a model for various models so you can see any similarities or differences in how each algorithm interprets the data. 

In the next section we discuss how you can further evaluate, and understand these models.

Step 6: Review How Your Chosen Model Works 

DataRobot offers superior transparency, interpretability, and explainability so you can easily understand: 1) how models were built; 2) how to evaluate a model’s performance; 3) what features are important to a model; 4) the effect of each feature on model predictions; 5) what is driving the model prediction for a particular instance; and 6) other insights such as Word Cloud. This level of transparency/interpretability/explainability ensures that the model you build and select incorporates industry best practices, is thoroughly evaluated, and has the capability to be shared with the users with understandable insights.

Tip: To gain insights, we recommend viewing the various tabs for a variety of the models so you can see any similarities or differences in how the algorithms interpret the data.

Blueprints: How models were built

What is it? 

A blueprint is a graphical representation of the steps taken to go from modeling data to model predictions.

Where is it? 

Click on any model on the Leaderboard, then click Describe.

A blueprint shows what DataRobot is doing behind the scene: starting from a raw dataset uploaded to DataRobot, preparing and getting the data ready to be ingested by the algorithm, building the models, and making some predictions (Figure 14).

Figure 14. BlueprintFigure 14. Blueprint

Lift Chart: How to evaluate models’ performance

What is it? 

A Lift Chart measures: 1) how well a model fits the data by comparing predictions with actual target values; and 2) how effective the model is in terms of differentiating different instances, such as low risk vs high risk of being readmitted.

Where is it? 

Click on any model on the Leaderboard, then click Evaluate, then click Lift Chart (it should be the first screen you see when you click Evaluate; in case you don’t see it, simply click Lift Chart.).

Figure 15 shows a typical Lift Chart. The chart sorts model predictions from low to high and creates a certain number of equal-sized bins (the graph in Figure 15 shows 10 bins). Then, for each bin, it calculates and plots the average predicted and actual target values. The blue curve represents the average predicted values, while the orange curve represents the average actual target values. The bins on the left are associated with low target values (or low risks) and those on the right are associated with high target values (or high risks). For a good classification model, you would see a “hockey stick shape” with both curves closely tracking each other (as shown in Figure 15).


Figure 15. Lift ChartFigure 15. Lift Chart

Profit curve and Payoff Matrices

What is it?

Profit Curves allow you to quickly do a cost-benefits analysis for correct or incorrect classifications to immediately visualize the net profit versus prediction threshold. This allows you to easily set the prediction threshold based on the result. 

Where is it? 

Select any model on the Leaderboard, then click Evaluate and Profit Curve.

In the hospital readmissions example, we are solving a binary classification problem. In such cases, it is important to note that DataRobot will actually output/predict probabilities (in our case, probability of a patient being readmitted). By default the prediction threshold is set to be 0.5, meaning that, in our case, if a patient has a probability of 50% or higher, DataRobot would label them as 1, meaning that the patient will be readmitted.

That being said, precisely tuning a prediction threshold for maximal business impact is an important task for binary classification problems. With the Profit Curve insight, it is possible to supply multiple Payoff Matrices representing costs and benefits for correct or incorrect classification to immediately visualize the net profit versus prediction threshold, and to easily set the prediction threshold based on the result. 

This essentially allows you to find the best prediction threshold so that you can maximize the impact to your business.

For a finished model on the Leaderboard, you can click on the model’s Evaluate tab, then Profit Curve. There you can add or edit the Payoff Matrix to control the Profit Curve visualization by assigning benefits or penalties to correct and incorrect predictions. Every Payoff Matrix is shared with all Leaderboard items within a project. As with other insights, the chart and its data can be exported as PNG or CSV.

Figure 16. Payoff MatrixFigure 16. Payoff Matrix

Feature Impact: What features are important to each model

What is it? 

Feature Impact measures how much each feature contributes to the overall accuracy of the model. It ranks the features from the most important to the least important. The horizontal bars show the relative importance of the features as compared to the most important feature which has 100%.

Where is it?  

Click on any model on the Leaderboard, then click Understand and Feature Impact.

Figure 17. Feature ImpactFigure 17. Feature Impact

For most models on the Leaderboard, you would have to click the Compute Feature Impact button (from the Feature Impact page) to get the feature impact for the model. Figure 17 shows a typical Feature Impact from one model. You can see that the feature that contributes the most to the model is number_inpatient (number of inpatient visits in the past year), followed by discharge_disposition_id (where was the patient discharged to: home, nursery facility or somewhere else). In other words, the likelihood of a patient being readmitted is highly correlated with the number of inpatient visits in the past year; where the patient was discharged to is also highly correlated with the likelihood of being readmitted although not as important as number of inpatient visits. These insights can be invaluable for a hospital to prioritize their focus for similar patients in the future. 

Feature Effects:  Effect of each feature on model predictions

What is it? 

Feature Effect shows how the changes in the value of each feature affect the model’s predictions. It depicts the relationship between each feature and the target.

Where is it? 

Click on any model on the Leaderboard, then click Understand and Feature Effects.

Different from Feature Impact (which shows overall importance of each feature to the model), the Feature Effects chart shows how each feature is driving the average predictions of the model. In the example in Figure 18, the X-axis shows number_inpatient, and Y-axis shows readmission (the probability of readmission). Clearly the likelihood of readmission increases as the number of inpatient visits increase with admission probability being the lowest for those with no inpatient visit in the past 12 months and increasing significantly for patients with inpatient visits. This increase gradually slows down as the number of inpatient visits reaches four or five. The flat curve after four or five visits indicates that likelihood of being readmitted is not very different for those patients with more than 5 inpatient visits. To see Feature Effects for a model, you would have to click the Compute Feature Effects button on the Feature Effects page which will also trigger the calculation for Feature Impact if not run already.

Figure 18. Feature EffectsFigure 18. Feature Effects

Prediction Explanations: What is driving the model prediction for a particular instance

What is it? 

Prediction Explanations help you understand why DataRobot generates a particular prediction for a specific instance. DataRobot provides up to 10 explanations (3, by default) for the 3 highest and lowest predictions.

Where is it? 

Click on any model on the Leaderboard, then click Understand and Prediction Explanations.

Prediction Explanations reveal the reasons why DataRobot generated a particular prediction for an instance. They provide a quantitative indicator of variable effect on individual predictions. The individual prediction level of insights are extremely valuable for decision makers since they can back up their decisions with detailed reasoning. In the example shown in Figure 19, the prediction of 0.865 is driven by the patient’s heavy weight (weight = [125-150) kg), more than 3 emergency room visits and 25 different medications. You can click Compute & Download (orange button) to export the prediction explanations as specified to an Excel file.

Figure 19. Prediction ExplanationsFigure 19. Prediction Explanations

Word Cloud: Frequency and Severity of each key word in the text features

What is it?

Word Cloud displays the most relevant works and short phrases in the word cloud format.

Where is it? 

Click on Insights from the top menu of the user interface, then click the Word Cloud tile. Or for specific text mining models, click on the model, then click Understand and Word Cloud.

Word Cloud provides a graphic of the most relevant words and short phrases in a word cloud format (Figure 20). The tab is only available for models trained with data that contains unstructured text. 

Text variables often contain words that are highly indicative of the target. In a word cloud, the size of the words represents the frequency of the words in the data: words that appeared more frequently are represented in larger font size. Words are displayed in a color spectrum from blue to red, with blue indicating a negative effect (i.e., negatively correlated with the target) and red indicating a positive effect (i.e. ,positively correlated with the target). 

In Figure 20, you can see that the word “acute” appears frequently in the data; however, it is often associated with low likelihood of being readmitted (blue color). The word “failure” is also frequently observed in the data, but the red color indicates it is associated with high probability of readmission.

Figure 19. Word CloudFigure 19. Word Cloud

Besides word cloud, the Insights tab also provides other graphical representations of your model. Depending on the business problem and type of data, under Insights tab, you may find tree-based variable rankings, hotspots, variable effects to illustrate the magnitude and direction of a feature's effect on a model's predictions, text mining charts, and anomaly detection. 

Step 7: Generate Model Documentation 

For industries that are highly regulated such as insurance and banking, the ability to produce model documentation is critical to the ultimate success of a modeling project. In addition, you may be required to present your work in DataRobot to a data scientist or management. Preparing the documentation with the appropriate level of information and transparency can be very time consuming. 

DataRobot makes this critical job easier by automatically generating an individualized document for each model on the Leaderboard. This document is not meant to be prescriptive in format or content; rather it serves as a comprehensive guide in creating sufficiently rigorous model development, implementation and use documentation. The document contains an overview of the model development process, with full insight into the model assumptions, limitations, performance, and validation detail. To generate the customized documentation, you can click on the desired model on the Leaderboard, then click Compliance and Generate Compliance Document (see Figure 21). Once the document is generated and downloaded, you can modify and adjust the contents as needed.

Figure 21. Generate Compliance DocumentFigure 21. Generate Compliance Document

Resources

Resources are linked throughout this guide.  Additional resources are available on DataRobot Community, which is constantly being updated with new content. A Learning Path specific to the Business Analyst DataRobot user is linked here.    

Contact Us

If you have suggestions for this guide or questions, please send them via private message to KarinAISD (me) in the community!


Glossary

Term

Definition

Accuracy measuring statistic

The optimization metrics on the Leaderboard that measure and reflect the accuracy of the trained models. The most common metric a user will encounter in regression and classification cases is LogLoss. The key point to know about LogLoss is that it measures error, so the smaller, the better.

Automated Machine Learning (AutoML)

A software system that automates many of the tasks involved in preparing a dataset for modeling and performing a model selection process to determine the performance of each with the goal of identifying the best performing model for a specific use case.

Autopilot Mode

The DataRobot modeling process that automatically selects the best predictive models for the specified target feature and optimization metric.

Blender

A model that combines the predictions of two or more models, which can lead to better results than running the models individually.

Blueprint

A graphical representation of the many steps involved in transforming input predictors and targets into a model. A blueprint represents the high-level end-to-end procedure for fitting the model, including any pre-processing steps, algorithms, and post-processing. Each box in a blueprint may represent multiple steps. You can view a graphical representation of a blueprint by clicking on a model in the Leaderboard.

Classification model

A model that helps classify a particular instance to different categories. When there are two categories, we call this binary classification model; if the number of categories is greater than two, this is often referred to as a multiclass classification model.

Clusters

Clusters, families of features denoted by color on the matrix, are features partitioned into groups based on their similarity. Clusters allow you to quickly understand the strength and nature of the associations and detect families of pairwise association clusters.

Confusion Matrix

A chart that shows the number of true positives versus false positives a model has predicted.

Exploratory Data Analysis (EDA)

DataRobot’s approach to analyzing datasets and summarizing their main characteristics. Generally speaking, there are two stages of EDA—EDA1 and EDA2. EDA1 provides summary statistics based on a sample of your data. EDA2 is the step used for model building and uses the entire dataset, based on the options selected.

Feature Engineering

The addition and construction of additional variables, or features, to your dataset to improve machine learning model performance and accuracy. The most effective feature engineering is based on sound knowledge of the business problem and your available data sources. Feature engineering is an exercise in engagement with the meaning of the problem and the data. For example, you might improve a model used to estimate likely loan defaults by finding external sources of relevant data, such as local unemployment rates or housing price trends.

Features

Columns within a dataset uploaded to DataRobot which will be used as data values for modeling.

Leaderboard

In DataRobot, the list of trained blueprints (models) for a project, ranked according to a project metric.

Lift Chart

A chart that reflects predicted versus actual values to help the user to determine where a model is over fitting or under fitting. They also show a models’ ability to discriminate.

Machine Learning (ML)

An application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

Partition

The segments (splits) of the dataset. To maximize accuracy, DataRobot separates data into training, validation, and holdout data. 

Regression model

A model with a numerical (or continuous) target variable, such as sales volume for a particular product. A regression model predicts a quantity.

Receiver Operating Characteristic (ROC) Curve

The ROC Curve tab helps to explore classification, performance, and statistics related to a selected model at any point on the probability scale. It is a graphical plot that shows a precise measure of a model’s ability to discriminate risk.

Target

A feature in the dataset that is to be predicted. This is typically the name of the column in a spreadsheet. Note: You may hear the terms target variable, dependent variable, response, and outcome. They are frequently used interchangeably and in this guide (and DataRobot in-app Platform Documentation) we generally use “target variable.”

Target leakage

Information that would probably not be available at the time of prediction. If DataRobot detects target leakage, the feature is marked with a warning flag in the Importance column. If the leaky feature is significantly correlated with the target, DataRobot will automatically omit it from the model building process. It also might flag features that suggest partial target leakage. For those, you should ask yourself whether the information would be available at the time of prediction and if not, exclude it so that you limit the risk of skewing your analysis.

Worker

The DataRobot platform performs data processing and model building as asynchronous processes. The computational nodes within the cluster that process these jobs are known as “workers.” The “worker cloud” refers to the group of compute servers available within a DataRobot cluster environment.

Labels (1)
Comments

this glossary is really helpful!

Version history
Revision #:
19 of 19
Last update:
‎12-09-2020 08:55 AM
Updated by: