FAQs: Evaluating Models

cancel
Showing results for 
Search instead for 
Did you mean: 

FAQs: Evaluating Models

(Updated February 2021)

This section provides answers to frequently asked questions related to evaluating models. If you don't find an answer for your question, you can ask it now; use Post your Comment (below) to get your question answered.

How can I compare the performance of my models?

There are many ways to compare model performance. The first place to look would be at the Leaderboard to compare model scores for the optimization metric you have used.

In addition, the DataRobot UI provides several displays for performing direct comparisons:

  • Learning Curves shows how effectively a model has improved in accuracy as it has seen more training data. This is good for deciding which models may benefit from being trained into the validation or hold-out sets.
  • Speed vs Accuracy compares model accuracy against the speed at which it can make predictions. So although blender models are often the most accurate, they come at the cost of prediction speed. If prediction latency is important for model deployment then this will help you find the most effective model.
  • Model Comparison lets you compare Lift Charts and ROC Curves between two different models. (Note that this is not available for multiclass prediction projects.)

lhaviland_35-1613696855220.png

lhaviland_36-1613696855331.png

lhaviland_37-1613696855294.png

lhaviland_38-1613696855307.png

More information for DataRobot users: search in-app Platform Documentation for Compare models.

Can I download charts from the UI?

Yes, it’s possible to download charts (like Lift Chart, Feature Fit, Feature Effects, etc.) using the Export button. You can download the graph as a PNG image and you can download the data used to build the chart as a CSV file.

lhaviland_25-1613696855245.png

lhaviland_26-1613696855291.png

How does DataRobot decide which model is 'Recommended for Deployment'?

DataRobot identifies the most accurate non-blender model and prepares it for deployment with four steps:

  1. DataRobot calculates feature impact and uses this to create a reduced feature list.
  2. Then, DataRobot retrains the model on the reduced feature lists and decides which of the two models (original or reduced feature list) should progress to the next stage.
  3. The selected model is then retrained on up-to-holdout sample size (usually 80% of the data)
  4. Finally, for non-time aware models, DataRobot retrains the model as a frozen run (hyperparameters frozen from the 80% run) and retrains on 100% of the data. For time-aware models, DataRobot retrains the model on the most recent data.

lhaviland_53-1613696855285.png

More information for DataRobot users: search in-app Platform Documentation for Leaderboard overview, then locate more information in the section “Understanding the model recommendation process.”

Why would I not just always use the most accurate model?

There could be several reasons, but two most common are:

  • Prediction latency—This means the speed at which predictions are made. Some business applications of a model will require very fast predictions on new data. The most accurate models are often blender models which are usually slower at making predictions.
  • Organizational readiness—Some organizations favor linear models and/or decision trees for perceived interpretability reasons. Additionally, there may be compliance reasons for favoring certain types of models over others.

Can I tune model hyperparameters?

Yes, you can tune model hyperparameters in the Evaluate > Advanced Tuning tab for a particular model. The recommendation is that often it's better to spend your time doing feature engineering than tuning hyperparameters. (DataRobot’s Feature Discovery provides automated feature engineering, as explained here.) 

lhaviland_29-1613696855271.png

More information for DataRobot users: search in-app Platform Documentation for Advanced Tuning.

What data is used in the ROC Curve?

You can select the graph data source from a dropdown just above the ROC Curve. The options available —Validation, Cross-Validation, and Holdout—are dependent on whether you have run or enabled that set.

lhaviland_45-1613696855264.png

More information for DataRobot users: search in-app Platform Documentation for ROC Curve.

Can DataRobot show metrics for assessing binary classification models other than the ones listed on the ROC Curve tab? I am thinking of metrics such as the Cohen's kappa.

There are many metrics for assessing binary classification models but not all of them are available inside of DataRobot. Often these can be calculated by downloading the data from DataRobot. For example, Cohen's kappa can be calculated using the exported ROC Curve data.

lhaviland_1-1613696855252.png

lhaviland_2-1613696855303.png

More information for DataRobot users: search in-app Platform Documentation for ROC Curve.

Why are there two threshold settings in the ROC Curve tab in the Prediction Distribution graph?

You will see two different thresholds displayed on the ROC Curve tab: Display Threshold [0-1], which allows you to experiment with different confusion matrices on the ROC Curve tab, and the Prediction Threshold, which allows you to set the final threshold that will be used to decide the class assignment for a given prediction value. Note that changing the Display Threshold [0-1] does NOT change the threshold that will be used for scoring new data. By default, the Prediction Threshold is 0.5. You can choose to set the latter threshold in the ROC Curve tab or in the Deployments tab, if the current model is used for deployment.

lhaviland_6-1613696855223.png

More information for DataRobot users: search in-app Platform Documentation for Rating Tables and Interpreting Generalized Additive Models (GA2M) output.

What is the relationship between the prediction distributions, the confusion matrix, and the two thresholds you set in the Prediction Distribution chart area of the ROC Curve?

lhaviland_10-1613696855310.png

(1) The prediction distributions form the foundation for the rest of the elements on the ROC Curve tab. The Prediction Distribution chart shows the distribution of the probabilities assigned to each prediction by the model, grouped by the actual class that each observation belonged to.

(2) The Confusion Matrix is a summary of the two distributions for a given probability threshold. It counts the number of positive and negative predictions and how many are correctly and incorrectly labeled, based on that probability threshold. As the probability threshold changes, the counts will change across the four quadrants.

(3) The ROC Curve is a plot of the true positive rate and false positive rates, calculated from the confusion matrix. As the probability threshold is reduced (and more records are classified as positive), the comparison point will move along the graph towards the right.

(4) The Cumulative Gain chart indicates the true positive rate for all predictions above the threshold. As the probability threshold is reduced (moved from right to left), the comparison point on the graph will move toward the right. Essentially measuring how well the model has aggregated the positive records at one end of the sort order, this value operates as a ranking measure similar to AUC or Gini.

(5) The Display Threshold [0-1] value can be moved and adjusted to evaluate the model and identify the best threshold (away from the preset value which maximizes the F1 score). The Prediction Threshold value will be used at model deployment and so should be changed to the appropriate value before deployment.

More information for DataRobot users: search in-app Platform Documentation for ROC Curve.

Can you explain the concept of model lift?

Technically "lift is a measure of the effectiveness of a predictive model, calculated as the ratio between the results obtained with and without the predictive model." Lift is the ratio of points correctly classified as positive in our model versus the 45-degree line (or baseline model) as seen on the Cumulative Gains plot (Evaluate > ROC Curve tab).

The ratios of these points create the Cumulative Lift chart, where for a given % of top predictions we can measure how much more effective the model is at identifying the positive class than the baseline model.

In the images below we can see that the Cumulative Gains chart shows a vertical orange line at 20% on the X-axis. In the baseline model this would equal 20% on the Y-axis, but we can see from the horizontal orange line that our model has correctly classified between 30-35% of these points as the positive class. The ratio of these two Y-axis percentages are shown in the Cumulative Lift chart and represent the lift of the model for the top 20% of predictions.

lhaviland_16-1613696855227.png

lhaviland_17-1613696855249.png

More information for DataRobot users: search in-app Platform Documentation for ROC Curve, then locate information in the section “Cumulative charts overview.”

How does DataRobot determine which threshold to use for a binary classification problem?

There are two thresholds on the Evaluate > ROC Curve tab:

  • Display Threshold [0-1]—This is interactive and, by default, set to the threshold that maximizes F1 score. Note that this does not impact predictions; it is solely used for analysis in the GUI.
  • Prediction Threshold—This is set to 0.5 by default, and should be set by you. This is the threshold used when DataRobot makes predictions. (DataRobot predictions consist of both probabilities as well as a y/n classification, and it's this classification that uses this threshold.)

lhaviland_18-1613696855282.png

DataRobot provides some suggestions to help you set the prediction threshold:

lhaviland_19-1613696855289.png

More information for DataRobot users: search in-app Platform Documentation for ROC Curve, then locate information in the “Threshold settings” section.

What does the diagonal gray line in the ROC Curve represent?

This represents the theoretical result you'd see if your model was randomly guessing with each prediction.

lhaviland_22-1613696855278.png

What is the Matthews Correlation Coefficient? How to find it in DataRobot?

The Matthews Correlation Coefficient (MCC) is a metric used for measuring the quality of a binary classification model. Unlike the F1 score, it incorporates all entries of the confusion matrix and so is more robust for data where the classes are of very different sizes (imbalanced).

The MCC score for a binary classification model can be found on the Evaluate > ROC Curve tab. It can also be used as an optimization metric.

lhaviland_23-1613696855231.png

lhaviland_24-1613696855287.png

More information for DataRobot users: search in-app Platform Documentation for Optimization metrics, then locate information for “Max MCC / Weighted Max MCC.”

How do I change the prediction threshold?

You will see two different thresholds displayed on the Evaluate > ROC Curve tab. You can change the Display Threshold [0-1] value to experiment and look at different confusion matrices on the ROC Curve tab, but doing so does NOT change the threshold used when predictions are made.

To change the prediction threshold, you need to change the Prediction Threshold value (also on this tab). Also, if desired, you can set this at the time of deployment.

lhaviland_41-1613696855260.png

lhaviland_42-1613696855266.png

More information for DataRobot users: search in-app Platform Documentation for ROC Curve, then locate information in the “Prediction threshold” section.

Does DataRobot provide a ROC Curve for all models?

 Yes, ROC Curve is available for all models built for classification problems.

lhaviland_32-1613696855217.png

More information for DataRobot users: search in-app Platform Documentation for ROC Curve.

What is the difference between density and frequency on the ROC Curve tab?

The density chart displays an equal area underneath both the positive and negative curves. The area underneath each frequency curve varies and is determined by the number of observations in each class.

lhaviland_48-1613696855339.png

More information for DataRobot users: search in-app Platform Documentation for ROC Curve.

What data is used to generate the Lift Chart?

You can select the graph data source from a dropdown just below the Lift Chart. The options available—Validation, Cross-Validation, and Holdout—are dependent on whether you have run or enabled that set.

lhaviland_46-1613696855315.png

More information for DataRobot users: search in-app Platform Documentation for Lift Charts.

Can I view the Lift Chart in more granularity than deciles?

Yes, it’s possible to view the Lift Chart with 10, 12, 15, 20, 30, or 60 bins. You can select these values using the ‘Number of Bins’ dropdown under the chart.

lhaviland_30-1613696855325.png

More information for DataRobot users: search in-app Platform Documentation for Lift Charts, then locate information in the section “Changing the display.”

Can I view the Lift Chart on training data?

The Lift Chart is available for the validation, cross-validation, or hold-out data—depending on how your model has been trained. But you won’t be able to view it for the data the models were actually trained on; you can only view it on the sets partitioned for testing model performance. Or, you can access and use external test datasets to better evaluate model performance.

lhaviland_31-1613696855329.png

More information for DataRobot users: search in-app Platform Documentation for Lift Charts, or for Make Predictions tab and locate information for “Making predictions on an external dataset.”

How can I see which features are most important?

To see which are most strongly correlated with the target on a univariate, i.e., non-modeling basis, look at the feature importance. To see which features are most important according to a particular model, look at Feature Impact.

lhaviland_39-1613696855328.png

lhaviland_40-1613696855345.png

More information for DataRobot users: search in-app Platform Documentation for Feature Impact, and for Modeling process detail (and then look for information about “importance” under the “Interpreting data summary information” section”). 

How is Feature Impact calculated?

There are three methodologies available for rendering Feature Impact—permutation, SHAP, and tree-based importance. By default, Feature Impact is calculated with a technique called "permutation importance." Calculated AFTER a model is built, this technique can be applied to any modeling algorithm. The idea is to take the dataset and 'destroy the information' in each column (by randomly shuffling the contents of the feature across the dataset), one column at a time, then make predictions on all resulting records and calculate the overall model performance. The permuted variable that had the largest impact on model performance is the most impactful feature and is given an impact value of 100%.

Features can have a negative impact on the model (i.e., the model improves when the shuffling occurs). It is recommended that you remove these features.

lhaviland_20-1613696855244.png

More information for DataRobot users: search in-app Platform Documentation for Feature Impact.

Can I get feature impact for all features?

The graph shows the top 30, but the top 1000 features are available via export as CSV.

lhaviland_0-1613696855268.png

More information for DataRobot users: search in-app Platform Documentation for Feature Impact.

Why am I getting different feature impacts from different models in my project? How can I use this information to identify the features that have a real effect on the business?

It's important to remember that the real-world situation that you are modeling is infinitely complex, and any model DataRobot builds is an approximation to that complex system. Each model has its strengths and weaknesses, and different models are able to capture varying degrees of that underlying complexity. For example, a model that is not capable of detecting nonlinear relationships or interactions will use the variables one way, while a model that can detect these relationships will use the variables another way, and so you will get different feature impacts from different models. Feature impact shouldn't be drastically different, however, so while the exact ordering will change, the overall inference is often not impacted. Collinearity can also impact this. If two variables are highly correlated, a regularized linear model will tend to use only one of them, while a tree-based method will tend to use both at different splits. So with the linear model, one of these variables will show up high in feature importance and the other will be low, while with the tree-based model, both will be closer to the middle.

More information for DataRobot users: search in-app Platform Documentation for Feature Impact.

What data partition is used to calculate feature impact?

A sample of training data is used to compute Feature Impact. Users can choose the sample size for non-time-aware projects up to a maximum of 100,000 rows; by default, sample size is 2500 rows. The sampling process corresponds to one of the following criteria:

  • For balanced data, random sampling is used.
  • For imbalanced binary data, smart downsampling is used; DataRobot attempts to make the distribution for imbalanced binary targets closer to 50/50 and adjusts the sample weights used for scoring.
  • For zero-inflated regression data, smart downsampling is used; DataRobot groups the non-zero elements into “minority.”
  • For imbalanced multiclass data, random sampling is used. (Note that changes/improvements to this process are in progress.)

More information for DataRobot users: search in-app Platform Documentation for Feature Impact.

What is the difference between Feature Fit and Feature Effects?

The main difference between these two displays is that Feature Fit uses Feature Importance to identify the most important features, whereas Feature Effects uses Feature Impact. The important distinction between Feature Importance and Feature Impact is that Feature Importance is calculated at the general level (i.e., not model dependent), whereas Feature Impact is calculated by each model depending on how it utilizes that feature.

In addition, by default partial dependence is turned off in Feature Fit (though you can turn it on), while actual and predicted are turned off by default in Feature Effects (though you can turn this on also).

lhaviland_49-1613696855304.png

lhaviland_50-1613696855214.png

More information for DataRobot users: search in-app Platform Documentation for Feature Fit and Feature Effects.

Why are my text variables not showing up in feature fit (or feature effects)?

Because there are so many unique words and n-grams in freeform text, they cannot be shown in a graph the way other variables can. Even the top few words often show up in a very small percentage of the rows, so there would be very little data if we were to show the top few variables the way we do with categorical.

To analyze the impact of text, you can check Word Cloud which is available when clicking on a feature name on Data tab.

lhaviland_11-1613696855317.png

Why isn't variable x showing up on the Feature Fit display?

Feature Fit is computationally intensive, especially for datasets with many features. The Feature Fit display is populated with features in the order they appear on the Data tab. This measure of importance is done using a non-linear correlation metric called ACE (Alternating conditional expectations).

lhaviland_12-1613696855320.png

If your dataset has hundreds of columns and the feature you are interested in is close to the bottom of the Data tab (when sorted by Importance), you may need to wait for Feature Fit to calculate the ACE for that feature. Also, DataRobot caps the output at 500 features max, so if a feature is not in the top 500 by ACE score, it will never show up in Feature Fit. Text features and the target will not show up in Feature Fit either.

More information for DataRobot users: search in-app Platform Documentation for Feature Fit.

Can I get Feature Fit and Feature Effects for all features?

Feature Fit and Feature Effects are available for the top 500 features. For Feature Fit (Evaluate tab > Feature Fit) this is calculated based on the feature importance; for Feature Effects (Understand tab > Feature Effects) it is based on the feature impact. Text features will not appear in either of these sets, even if they have high feature importance or feature impact scores because there are too many possible values in a text feature.

This image (from the Data tab) shows that feature diag_3_desc is in the top six of feature importance.

lhaviland_13-1613696855219.png

However, in the Feature Fit page you see that this feature is not included in the feature list on the left of the chart.

lhaviland_14-1613696855255.png

More information for DataRobot users: search in-app Platform Documentation for Feature Fit.

My data has no missing values, so why does Feature Fit (and Feature Effects) show a missing category?

Likely you are getting missing values in scoring data at prediction time and the effect of those missing values is shown in these displays, as explained in this other FAQ.

How are there partial dependence values for "missing" values when there are no "missing" values in my dataset?

You may not have missing values in your modeling dataset, but you may get missing values in scoring data at prediction time and the effect of those missing values is shown. DataRobot applies the same process when calculating all values: set all values equal to missing and calculate the average prediction.

lhaviland_3-1613696855296.png

What do the histograms in the Feature Fit (or Feature Effects) exhibit represent the sum of?

These histograms represent a count of rows or a sum of exposures (if exposures were used in the project) across either the training, validation, or holdout partition, depending on what you select from the Data Selection dropdown list below the graph.

lhaviland_21-1613696855239.png

More information for DataRobot users: search in-app Platform Documentation for Feature Fit or Feature Effects.

Can I see the reasons why a model made a certain prediction?

After you build models, you can use Understand > Prediction Explanations tab to help you understand the reasons DataRobot generated individual predictions. Depending on how the project is configured, you can view XEMP- or SHAP-based Prediction Explanations.

lhaviland_27-1613696855293.png

lhaviland_28-1613696855262.png

More information for DataRobot users: search in-app Platform Documentation for Prediction Explanations, SHAP-based Prediction Explanations, or XEMP Prediction Explanations.

What does the “ID” represent on the (XEMP-based) Prediction Explanations tab?

The number in the ID column is the row number ID from the imported dataset.

lhaviland_47-1613696855326.png

How many explanations can I get for each prediction?

DataRobot will give you three explanations as default, but this can be extended up to ten by changing the value in the Get top [ ] explanations box above the chart.

lhaviland_43-1613696855302.png

More information for DataRobot users: search in-app Platform Documentation for SHAP-based Prediction Explanations or XEMP Prediction Explanations.

Records from what data partition are returned on the Prediction Explanations page?

Prediction Explanations are returned for data in the validation partition. It is also possible to calculate and download the XEMP-based Prediction Explanations for the training data by clicking the orange Compute & download button.

lhaviland_44-1613696855308.png

More information for DataRobot users: search in-app Platform Documentation for XEMP Prediction Explanations.

In the Speed vs Accuracy graph, what exactly does speed measure?

Speed shows the time it takes for the model to score 1000 records in milliseconds. Most importantly, it is NOT measuring time for round-trip API call, i.e., network latency. If this measurement is of interest, it must be tested in the actual system.

lhaviland_4-1613696855237.png

More information for DataRobot users: search in-app Platform Documentation for Compare models, then locate information for "Using the Speed vs Accuracy tab."

How can I determine how long a real-time prediction will take to score?

To answer this question, you need to account for both model speed and latency. 

lhaviland_15-1613696855272.png

You can find model speed under the Speed vs Accuracy tab but the best way to account for latency is by testing it in your environment where the model is deployed. 

Will my models improve if I add more observations to my training data?

Learning Curves are designed to answer this question. It shows how modeling performance improves as you add more data. Typically, as more observations are added, a model's performance will improve initially and then begin to level off.

lhaviland_7-1613696855312.png

More information for DataRobot users: search in-app Platform Documentation for Compare models, then locate information for "Using the Speed vs Accuracy tab."

Why don’t I see all models on the Learning Curves?

DataRobot groups models on the Leaderboard by the blueprint ID and feature list. For example, every Regularized Logistic Regression model built using the Informative Features feature list is a single model group, while a Regularized Logistic Regression model built using a different feature list is part of a different model group.

lhaviland_51-1613696855322.png

Learning Curves only shows the top 10 performing model groups, plus the highly performing blender models.

lhaviland_52-1613696855225.png

More information for DataRobot users: search in-app Platform Documentation for Compare models, then locate more information in the section “Learning Curves additional info.”

How can I change the metric used on the vertical axis of the learning curve?

The Learning Curves display is based on the validation score, using the currently selected metric. To change the metric you have to navigate to the Leaderboard and change the metric, 

lhaviland_33-1613696855247.png

then return to the Learning Curves display to see changes to the graph based on the newly selected metric:

lhaviland_34-1613696855313.png

More information for DataRobot users: search in-app Platform Documentation for Compare models, then locate information in the “Using Learning Curves” section.

What are rating tables and what types of models generate them?

Rating tables are generated by Generalized Additive Models (GAM). They provide information about the model in general as well as a list of features and coefficients used to make predictions, including any interactions of features the model has found.

The rating table can be downloaded as a CSV file in the Rating Table tab.

lhaviland_8-1613696855229.png

lhaviland_9-1613696855305.png

You can influence the predictions by updating values in the downloaded rating table and then uploading the table to create a new model.

More information for DataRobot users: search in-app Platform Documentation for Interpreting Generalized Additive Models (GA2M) output and Rating Tables.

How can I find models that produce rating tables?

Rating tables are generated by Generalized Additive Models (GAM). They look and feel very much like the output of a Generalized Linear Model (GLM): an intercept along with multiplicative coefficients. You can find GAMs on the Leaderboard by looking for models with rating table icon as shown in the image below.

lhaviland_5-1613696855218.png

More information for DataRobot users: search in-app Platform Documentation for Rating Tables.

Version history
Last update:
‎02-19-2021 07:10 AM
Updated by:
Contributors