FAQs: Evaluating Models

cancel
Showing results for 
Search instead for 
Did you mean: 

FAQs: Evaluating Models

This section provides answers to frequently asked questions related to evaluating models. If you don't find an answer for your question, you can ask it now; use Post your Comment (below) to get your question answered.

Can I get feature impact for all features?

The graph shows the top 30, but the top 1000 features are available via export as CSV.

Can I get feature impact for all features_a.png

More information for DataRobot users: search in-app documentation for Feature Impact.

Can DataRobot show metrics for assessing binary classification models other than the ones listed on the ROC Curve tab? I am thinking of metrics such as the Cohen's kappa.

There are many metrics for assessing binary classification models but not all of them are available inside of DataRobot. Often these can be calculated by downloading the data from DataRobot. For example, Cohen's kappa can be calculated using the exported ROC Curve data.

Can DataRobot show metrics for assessing binary classification models other than_a.png

Can DataRobot show metrics for assessing binary classification models other than_2a.png

More information for DataRobot users: search in-app documentation for ROC Curve.

How are there partial dependence values for "missing" values when there are no "missing" values in my dataset?

You may not have missing values in your modeling dataset, but you may get missing values in scoring data at prediction time and the effect of those missing values is shown. DataRobot applies the same process when calculating all values: set all values equal to missing and calculate the average prediction.

How are there partial dependence values for _missing_ values_a.png

Why am I getting different feature impacts from different models in my project? How can I use this information to identify the features that have a real effect on the business?

It's important to remember that the real world situation that you are modeling is infinitely complex, and any model DataRobot builds is an approximation to that complex system. Each model has its strengths and weaknesses, and different models are able to capture varying degrees of that underlying complexity. For example, a model that is not capable of detecting nonlinear relationships or interactions will use the variables one way, while a model that can detect these relationships will use the variables another way, and so you will get different feature impacts from different models. Feature impact shouldn't be drastically different, however, so while the exact ordering will change, the overall inference is often not impacted. Collinearity can also impact this. If two variables are highly correlated, a regularized linear model will tend to use only one of them, while a tree-based method will tend to use both at different splits. So with the linear model, one of these variables will show up high in feature importance and the other will be low, while with the tree-based model, both will be closer to the middle.

More information for DataRobot users: search in-app documentation for Feature Impact.

In the Speed vs Accuracy graph, what exactly does speed measure?

Speed shows the time it takes for the model to score 2000 records in milliseconds. Most importantly, it is NOT measuring time for round-trip API call, i.e., network latency. If this measurement is of interest, it must be tested in the actual system.

In the speed vs accuracy chart_a.png

More information for DataRobot users: search in-app documentation for Compare models, then locate information for "Using the Speed vs Accuracy tab."

What data partition is used to calculate feature impact?

For non-time aware projects, a sample of 2500 rows from training data is used to compute feature impact. The sampling process follows one of the following criteria:

  • For balanced data, random sampling is used.
  • For imbalanced binary data, smart downsampling is used; DataRobot attempts to make the distribution for imbalanced binary targets closer to 50/50 and adjusts the sample weights used for scoring.
  • For Zero-inflated Regression data, smart downsampling is used; DataRobot groups the non-zero elements into “minority.”
  • For imbalanced multiclass data, random sampling is used. (Note that changes/improvements to this process are in progress.)

More information for DataRobot users: search in-app documentation for Feature Impact.

How can I find models that produce rating tables?

Rating tables are generated by Generalized Additive Models (GAM). They look and feel very much like the output of a Generalized Linear Model (GLM): an intercept along with multiplicative coefficients. You can find GAMs on the Leaderboard by looking for models with rating table icon as shown in the image below.

How can I find models that produce a Rating Table.png

More information for DataRobot users: search in-app documentation for Rating Tables.

Why are there two prediction thresholds (i.e., Threshold (0-1) and Threshold set for Prediction Output) in the ROC Curve tab in the Prediction Distribution graph?

You will see two different thresholds displayed on the ROC Curve tab: Threshold (0-1), which allows you to experiment with different confusion matrices on the ROC Curve tab, and the Threshold set for Prediction Output, which allows you to set the final threshold that will be used to decide the class assignment for a given prediction value. Note that changing the Threshold (0-1) does NOT change the threshold that will be used for scoring new data. By default, the Threshold set for Prediction Output is 0.5. You can choose to set the latter threshold in the ROC Curve tab or in the Deployments tab, if the current model is used for deployment.

Why are there two prediction thresholds_a.png

More information for DataRobot users: search in-app documentation for Rating Tables and Interpreting Generalized Additive Models (GA2M) output.

Will my models improve if I add more observations to my training data?

Learning curves are designed to answer this question. As more observations are added, a model's performance will improve initially and then begin to level off. (This is important for anyone who believes they have more data than DataRobot can handle.) Also, it's important to distinguish what happens if you add more columns versus add more rows to your training dataset; often we get the question about "more data," but the answer very much depends on if the additional data is features or observations.

Will my models improve if I add more observations to my training data.png

More information for DataRobot users: search in-app documentation for Compare models, then locate information for "Using the Speed vs Accuracy tab."

What are rating tables and what types of models generate them?

Rating tables are generated by Generalized Additive Models (GAM). They provide information about the model in general as well as a list of features and coefficients used to make predictions, including any interactions of features the model has found.

The rating table can be downloaded as a CSV in the Rating Table tab.

What is a rating table and what type of models generate them_01.png

What is a rating table and what type of models generate them_02.png

You can influence the predictions by updating values in the downloaded rating table and then uploading the table to create a new model.

More information for DataRobot users: search in-app documentation for Interpreting Generalized Additive Models (GA2M) output and Rating Tables.

What is the relationship between the prediction distributions, the confusion matrix, and the two thresholds you set in the Prediction Distribution chart area of the ROC Curve?

What is the relationship between the prediction distributions, the confusion matrix, and the two thresholds you set in the “Prediction Distribution” interface of the ROC Curve.png

(1) The prediction distributions form the foundation for the rest of the elements on the ROC Curve page. The Prediction Distribution chart shows the distribution of the probabilities assigned to each prediction by the model, grouped by the actual class that each observation belonged to.

(2) The confusion matrix is a summary of the two distributions for a given probability threshold. It counts the number of positive and negative predictions and how many are correctly and incorrectly labeled, based on that probability threshold. As the probability threshold changes, the counts will change across the four quadrants.

(3) The ROC curve is a plot of the true positive rate and false positive rates, calculated from the confusion matrix. As the probability threshold is reduced (and more records are classified as positive), the comparison point will move along the graph towards the right.

(4) The Cumulative Gain chart indicates the true positive rate for all predictions above the threshold. As the probability threshold is reduced (moved from right to left), the comparison point on the graph will move toward the right. Essentially measuring how well the model has aggregated the positive records at one end of the sort order, this value operates as a ranking measure similar to AUC or Gini.

(5) The Threshold (0-1) value can be moved and adjusted to evaluate the model and identify the best threshold (away from the preset value which maximizes the F1 score). The Threshold set for Prediction Output is the threshold value that will be used at model deployment and so should be changed to the appropriate value before deployment.

More information for DataRobot users: search in-app documentation for ROC Curve.

Why are my text variables not showing up in feature fit (or feature effects)?

Because there are so many unique words and n-grams in free-form text, they cannot be shown in a graph the way other variables can. Even the top few words often show up in a very small percentage of the rows, so there would be very little data if we were to show the top few variables the way we do with categorical.

Why isn't variable x showing up on the Feature Fit display?

Feature Fit is computationally intensive, especially for datasets with many features. The Feature Fit display is populated with features in the order they appear on the Data tab. This measure of importance is done using a non-linear correlation metric called ACE (Alternating conditional expectations).

If your dataset has hundreds of columns and the feature you are interested in is close to the bottom of the Data tab, when sorted by importance, you may need to wait for Feature Fit to calculate the ACE for that feature. Also, DataRobot caps the output at 500 features max, so if a feature is not in the top 500 by ACE score, it will never show up in Feature Fit. Text features and the target will not show up in Feature Fit either.

More information for DataRobot users: search in-app documentation for Feature Fit.

How does DataRobot decide which model to recommend for deployment?

DataRobot first identifies the most accurate non-blender model and then prepares it for deployment; the resulting prepared model is labeled "Recommended for Deployment." The rationale for this is that non-blenders are faster to score than blenders. Whether or not prediction speed is important will depend on your specific use case for the model.

How does DR decide which model to recommend for deployment.png

More information for DataRobot users: search in-app documentation for Leaderboard overview, then locate information for the “Interpreting model badges" section.

Can I get Feature Fit and Feature Effects for all features?

Feature Fit and Feature Effects are available for the top 500 features. For Feature Fit this is calculated based on the feature importance; for Feature Effects it is based on the feature impact. Text features will not appear in either of these sets, even if they have high feature importance or feature impact scores because there are too many possible values in a text feature.

This image (from the Data page) shows that feature diag_3_desc is in the top five of feature importance.

Can I get feature fit (and feature effects) for all features_01.png

However, in the Feature Fit page you see that this feature is not included in the feature list on the left of the chart.

Can I get feature fit (and feature effects) for all features_02.png

More information for DataRobot users: search in-app documentation for Feature Fit.

How should I determine how long a real-time prediction will take to score?

The best way to determine this is to test it in your environment where the model is deployed.

Can you explain the concept of model lift?

Technically "lift is a measure of the effectiveness of a predictive model, calculated as the ratio between the results obtained with and without the predictive model." Lift is the ratio of points correctly classified as positive in our model versus the 45-degree line (or baseline model) as seen on the Cumulative Gains plot.

The ratios of these points create the Cumulative Lift chart, where for a given % of top predictions we can measure how much more effective the model is at identifying the positive class than the baseline model.

In the images below we can see that the Cumulative Gains chart shows a vertical orange line at 20% on the X-axis. In the baseline model this would equal 20% on the Y-axis, but we can see from the horizontal orange line that our model has correctly classified between 30-35% of these points as the positive class. The ratio of these two Y-axis percentages are shown in the Cumulative Lift chart and represent the lift of the model for the top 20% of predictions.

Can you explain the concept of model lift_01.png

Can you explain the concept of model lift_02.png

More information for DataRobot users: search in-app documentation for ROC Curve, then locate information in the section “Cumulative charts overview.”

How does DataRobot determine which threshold to use for a binary classification problem?

There are two thresholds on the ROC Curve tab:

  • Threshold—This is interactive and, by default, set to the threshold that maximizes F1 score. Note that this does not impact predictions; it is solely used for analysis in the GUI.
  • Threshold used for predictions—This is set to 0.5 by default, and should be set by you. This is the threshold used when DataRobot makes predictions. (DataRobot predictions consist of both probabilities as well as a y/n classification, and it's this classification that uses this threshold.)

How does DataRobot determine which threshold to use for binary classification problem.png

More information for DataRobot users: search in-app documentation for ROC Curve, then locate information in the “Threshold settings” section.

How is Feature Impact calculated?

Feature Impact is calculated with a technique sometimes called "permutation importance." Calculated AFTER a model is built, this technique can be applied to any modeling algorithm. The idea is to take the dataset and 'destroy the information' in each column (by randomly shuffling the contents of the feature across the dataset), one column at a time, then make predictions on all the resulting records and calculate the overall model performance. The permuted variable that had the largest impact on model performance is the most impactful feature and is given an impact value of 100%.

Features can have a negative impact on the model (i.e., the model improves when the shuffling occurs). It is recommended that you remove these features.

How is Feature Impact calculated.png

More information for DataRobot users: search in-app documentation for Feature Impact.

My data has no missing values, so why does Feature Fit (and Feature Effects) show a missing category?

When DataRobot calculates Feature Fit and Feature Effects it also calculates the partial dependence. For numeric variables DataRobot calculates the partial dependence score for missing, even if there are no missing values in the data. This is to help evaluate that feature if missing values were to appear in the future.

In the Feature Fit tab this may mean that the X-axis has a “missing” value, but without actuals or predictions. You can toggle the partial dependence option on to see the value identified as missing.

My data has no missing values, why does Feature Fit (and Feature Effects) show a missing category.png

More information for DataRobot users: search in-app documentation for Feature Fit or Feature Effects.

What do the histograms in the Feature Fit (or Feature Effects) exhibit represent the sum of?

These histograms represent a count of rows or a sum of exposures (if exposures were used in the project) across either the training, validation, or holdout partition, depending on what you select from the Data Selection dropdown list below the graph.

What do the histograms in the Feature Fit (or Feature Effects) exhibit represent the sum of.png

More information for DataRobot users: search in-app documentation for Feature Fit or Feature Effects.

What does the diagonal gray line in the ROC Curve represent?

This represents the theoretical result you'd see if your model was randomly guessing with each prediction.

What does the diagonal gray line in the ROC Curve represent.png

What is the Matthews Correlation Coefficient? How to find it in DataRobot?

The Matthews Correlation Coefficient (MCC) is a metric used for measuring the quality of a binary classification model. Unlike the F1 score, it incorporates all entries of the confusion matrix and so is more robust for data where the classes are of very different sizes (imbalanced).

The MCC score for a binary classification model can be found on the ROC Curve tab. It can also be used as an optimization metric.

What is the Matthews Correlation Coefficient_ How to find it in DataRobot_01.png

What is the Matthews Correlation Coefficient_ How to find it in DataRobot_02.png

More information for DataRobot users: search in-app documentation for Optimization metrics, then locate information for “MCC.”

Why would I not just always use the most accurate model?

There could be several reasons, but two most common are:

  • Prediction latency—This means the speed at which predictions are made. Some business applications of a model will require very fast predictions on new data. The most accurate models are often blender models which are usually slower at making predictions.
  • Organizational readiness—Some organizations favor linear models and/or decision trees for perceived interpretability reasons. Additionally, there may be compliance reasons for favoring certain types of models over others.

Can I download the Lift Chart via the GUI?

Yes, it’s possible to download the Lift Chart using the Export button. You can download the graph as a PNG image and you can download the data used to build the chart as a CSV file.

Can I download the Lift Chart via the GUI_01.png

Can I download the Lift Chart via the GUI_02.png

Can I export Feature Fit (or Feature Effects) via the GUI?

Yes, it’s possible to download both the feature fit and feature effects graphs using the ‘Export’ button. You can download the graph as a PNG image and you can download the data used to build the chart as a CSV file.

Can I export Feature Fit (or Feature Effects) via the GUI_01.png

Can I export Feature Fit (or Feature Effects) via the GUI_02.png

Can I see the reasons why a model made a certain prediction?

After you build models, you can use Prediction Explanations to help you understand the reasons DataRobot generated individual predictions.

Can I see the reasons why a model made a certain prediction.png

More information for DataRobot users: search in-app documentation for Prediction Explanations or Prediction Explanation considerations.

Can I tune model hyperparameters?

Yes, you can tune model hyperparameters in the Advanced Tuning tab, which is found on the Evaluate menu for a particular model. The recommendation is that often it's better to spend your time doing feature engineering than tuning hyperparameters.

Can I tune model hyperparameters.png

More information for DataRobot users: search in-app documentation for Advanced Tuning.

Can I view the Lift Chart in more granularity than deciles?

Yes, it’s possible to view the lift chart with 10, 12, 15, 20, 30, or 60 bins. You can select these values using the ‘Number of Bins’ dropdown under the chart.

Can I view the Lift Chart in more granularity than deciles.png

More information for DataRobot users: search in-app documentation for Lift Charts, then locate information in the section “Changing the display.”


Can I view the Lift Chart on training data?

The Lift Chart is available for the validation, cross-validation, or hold-out data—depending on how your model has been trained. But you won’t be able to view it for the data the models were actually trained on; you can only view it on the sets partitioned for testing model performance.

Can I view the Lift Chart on training data.png

Does DataRobot provide a ROC curve for all models?

No, because ROC curves are used to compute the performance of a model in predicting which class an observation belongs to and so they are only available for classification problems.

Does DataRobot provide a ROC curve for all models.png

More information for DataRobot users: search in-app documentation for ROC Curve.


How can I change the metric used on the vertical axis of the learning curve?


The Learning Curves display is based on the validation score, using the currently selected metric. To change the metric you have to navigate to the Leaderboard and change the metric, then return to the Learning Curves display.

How can I change the metric used on the vertical axis of the learning curve.png

More information for DataRobot users: search in-app documentation for Compare models, then locate information in the “Using the Speed vs Accuracy tab” section.

How can I compare the performance of my models?

There are many ways to compare model performance. The first place to look would be at the Leaderboard to compare model scores for the optimization metric you have used.

In addition, the DataRobot GUI provides several displays for performing direct comparisons:

  • Learning Curves shows how effectively a model has improved in accuracy as it has seen more training data. This is good for deciding which models may benefit from being trained into the validation or hold-out sets.
  • Speed vs Accuracy compares model accuracy against the speed at which it can make predictions. So although blender models are often the most accurate, they come at the cost of prediction speed. If prediction latency is important for model deployment then this will help you find the most effective model.
  • Model Comparison lets you compare Lift Charts and ROC Curves between two different models. (Note that this is not available for multiclass prediction projects.)

How can I compare the performance of my models_01.png

How can I compare the performance of my models_02.png

How can I compare the performance of my models_03.png

How can I compare the performance of my models_04.png

 More information for DataRobot users: search in-app documentation for Compare models.

How can I see which features are most important?

To see which are most strongly correlated with the target on a univariate, i.e., non-modeling basis, look at the Feature Importance. To see which features are most important according to a particular model, look at Feature Impact.

How can I see which features are most important_01.png

How can I see which features are most important_02.png

More information for DataRobot users: search in-app documentation for Feature Impact.

How do I change the prediction threshold?

You will see two different thresholds displayed on the ROC Curve tab. You can change the ‘Threshold (0-1)’ value to experiment and look at different confusion matrices on the ROC Curve tab, but doing so does NOT change the threshold used when predictions are made.

To change the prediction threshold, you need to change the 'Threshold set for Prediction Output' section on the ROC Curve tab. Additionally, you can set this at deployment.

How do I change the prediction threshold_01.png

How do I change the prediction threshold_02.png

More information for DataRobot users: search in-app documentation for ROC Curve, then locate information in the “Prediction threshold” section.

How many explanations can I get for each prediction?

DataRobot will give you three explanations as default, but this can be extended up to ten by changing the value in the Get top [] explanations box above the chart.

How many explanations can I get for each prediction.png

More information for DataRobot users: search in-app documentation for Prediction Explanations.

Records from what data partition are returned on the Prediction Explanations page?

Prediction Explanations are returned for data in the validation partition. It is also possible to calculate and download the Prediction Explanations for the training data by clicking the orange Compute & Download button.

Records from what data partition are returned on the _prediction explanations_ page.png

More information for DataRobot users: search in-app documentation for Prediction Explanations.

What data is used in the ROC Curve?

You can select the graph data source from a dropdown just above the ROC curve. The options available —Validation, Cross-Validation, and Holdout—are dependent on whether you have run or enabled that set.

What data is used in the ROC Curve.png

More information for DataRobot users: search in-app documentation for ROC Curve.

What data is used to generate the Lift Chart?

You can select the graph data source from a dropdown just below the Lift Chart. The options available—Validation, Cross-Validation, and Holdout—are dependent on whether you have run or enabled that set.

What data is used to generate the Lift Chart.png

More information for DataRobot users: search in-app documentation for Lift Charts.

What does the “ID” represent on the Prediction Explanations tab?

The number in the ID column is the row number ID from the imported dataset.

What does the “ID” represent on the Prediction Explanations page.png

What is the difference between density and frequency on the ROC Curve tab?

The density chart displays an equal area underneath both the positive and negative curves. The area underneath each frequency curve varies and is determined by the number of observations in each class.

What is the difference between density and frequency on the ROC Curve page.png

More information for DataRobot users: search in-app documentation for ROC Curve.

What is the difference between Feature Fit and Feature Effects?

The main difference between these two displays is that Feature Fit uses Feature Importance to identify the most important features, whereas Feature Effects uses Feature Impact. The important distinction between Feature Importance and Feature Impact is that Feature Importance is calculated at the general level (i.e., not model dependent), whereas Feature Impact is calculated by each model depending on how it utilizes that feature.

In addition, by default partial dependence is turned off in Feature Fit (though you can turn it on), while actual and predicted are turned off by default in Feature Effects (though you can turn this on also).

What is the difference between Feature Fit and Feature Effects_01.png

What is the difference between Feature Fit and Feature Effects_02.png

More information for DataRobot users: search in-app documentation for Feature Fit and Feature Effects.

Why don’t I see all models on the Learning Curves?

DataRobot groups models on the Leaderboard by the blueprint ID and Feature List. So, for example, every Regularized Logistic Regression model, built using the Informative Features feature list, is a single model group. A Regularized Logistic Regression model built using a different feature list is part of a different model group.

Learning Curves only shows the top 10 performing model groups, plus the highly performing blender models.

Why don’t I see all models on the Learning Curves_01.png

Why don’t I see all models on the Learning Curves_02.png

More information for DataRobot users: search in-app documentation for Compare models, then locate more information in the section “Learning Curves additional info.”


How does DataRobot decide which model is 'Recommended for Deployment'?

DataRobot identifies the most accurate non-blender model and prepares it for deployment with four steps:

  1. DataRobot calculates feature impact and uses this to create a reduced feature list.
  2. Then, DataRobot retrains the model on the reduced feature lists and decides which of the two models (original or reduced feature list) should progress to the next stage.
  3. The selected model is then retrained on up-to-holdout sample size (usually 80% of the data)
  4. Finally, for non-time aware models, DataRobot retrains the model as a frozen run (hyperparameters frozen from the 80% run) and retrains on 100% of the data. For time-aware models, DataRobot retrains the model on the most recent data.

More information for DataRobot users: search in-app documentation for Leaderboard overview, then locate more information in the section “Understanding the model recommendation process.”

Version history
Revision #:
19 of 19
Last update:
‎04-15-2020 09:33 PM
Updated by:
 
Contributors