FAQs: Building Models

cancel
Showing results for 
Search instead for 
Did you mean: 

FAQs: Building Models

(Updated February 2021)

This section provides answers to frequently asked questions related to building models. If you don't find an answer for your question, you can ask it now; use Post your Comment (below) to get your question answered.

How does DataRobot select the right blueprint for a given problem?

DataRobot doesn't "select" blueprints; it creates them dynamically based on the dataset that you give it and the target you specify.

Data scientists at DataRobot have spent over 8 years embedding their data science knowledge and best practices into this blueprint development process. Hundreds of blueprints are created for each project.

For most data science problems, you should come up with multiple viable approaches without knowing which ones will perform the best. DataRobot uses the Leaderboard to compare blueprints and identify the best ones.

lhaviland_0-1614556860990.png

More information for DataRobot users: search in-app Platform documentation for Leaderboard overview.

Can I see the effect of some of the DataRobot-created features?

DataRobot shows the effect of features—engineered/generated and non-engineered, raw—through the Insights tab. 

lhaviland_1-1614556860969.png

Here, both Tree-Based Variable Importance as well as Variable Effects show the variables used by the models.

It’s important to note that all features—those imported with the dataset (i.e., raw features) and those created by transformations—can be evaluated for predictive signal and interpreted for their relationship to the target variable. 

Determining which which features are transformed (either manually or automatically)

See Data menu > Project Data tab. Features created by variable transformation are indicated with an info (i) icon, for example:

lhaviland_2-1614556861045.png

In contrast, features that DataRobot generated as part of Time Series or Feature Discovery are not listed in the Project Data tab with the info (i) icon.

Instead, DataRobot creates new feature lists for those newly created time series features:

lhaviland_3-1614556860971.png

And features generated as part of Feature Discovery will be listed in the Project Data table with names consisting of the dataset alias and type of transformation. If you select one of these newly generated features, you can view the Feature Lineage (visual description of how the feature was generated:

lhaviland_4-1614556861042.png

More information for DataRobot users: search in-app Platform documentation for DataRobot insights, and locate information in the sections “Using Tree-Based Variable Importance” and “Variable Effects.” You can also search for Time series modeling, Feature lists, Generated features, and Feature transformations.

Is it possible for users to change the preprocessing method?

Yes, you can change some of the preprocessing methods used within DataRobot on the Advanced Tuning tab of each model and even perform a grid search of the best hyperparameters to explore. If you have a custom preprocessing step that you want to include, you could do it outside of DataRobot and load in the processed dataset. This becomes seamless when using the DataRobot R or Python clients.

lhaviland_5-1614556860975.png

More information for DataRobot users: search in-app Platform documentation for Coefficients (and preprocessing details) and locate information for "Matrix of token occurrences."

What do asterisks on the Leaderboard metrics mean?

The asterisks indicate that the scores are computed from stacked predictions on the model's training data. Stacked predictions is DataRobot’s method for ensuring predictions from training data don’t have misleadingly high accuracy. (To see information for an asterisk, just hover over its tooltip.)

lhaviland_6-1614556861050.png

More information for DataRobot users: search in-app Platform Documentation for Leaderboard overview and locate information for “Understanding asterisked scores.”

What does a snowflake icon located near a model in the Leaderboard mean?

This indicates a frozen run. A frozen run of a model is a retrained version of another model, where hyperparameters from the other model are frozen and the model is simply retrained on more observations. The sample percentage used to obtain the parameters is displayed after the snowflake. 

(Another way to understand this is as parent and child models. In this scenario, a child model is the frozen run of a retrained parent model. The hyperparameters from the parent model are frozen and the child model is simply retrained on more observations.)

For example,  this model (child model) was based on “frozen” parameter settings from the 80% sample size version of the model (parent model):

lhaviland_7-1614556861044.png

More information for DataRobot users: search in-app Platform Documentation for Frozen runs.

Why didn't cross-validation automatically run on my dataset?

Cross-validation has a hard cutoff at 50,000 rows; if the dataset is greater than or equal to 50,000 rows, DataRobot does not run cross-validation automatically. 

  • If you require automatic cross-validation, use a dataset with fewer than 50,000 rows.
  • If you need to get cross-validation values for a specific model, on the Leaderboard for that model click Run (under the Cross Validation column).  
    • If the dataset is larger than 800MB, DataRobot will use TVH by default. (Cross-validation is generally best practice for smaller datasets where you would not otherwise have enough useful data using TVH.)

lhaviland_8-1614556860980.png

 More information for DataRobot users: search in-app Platform Documentation for Data partitioning and validation.

How does DataRobot perform Cross Validation?

By default, DataRobot will use 80% of a dataset split into 5 folds with stratified sampling (each fold preserves the original ratio of the target values). The remaining 20% of the dataset is reserved as a holdout. The sizes and number of folds can be adjusted via Advanced Settings.

DataRobot will automatically carry out cross validation (CV) if the dataset is fewer than 50,000 rows; otherwise, it will use training, validation, and holdout (TVH) partitioning. (Cross-validation is generally useful for smaller datasets where you would not otherwise have enough useful data using TVH.) If the dataset is larger than 800MB, CV won't be allowed and TVH has to be used.

To manually run CV (rather than TVH) for a specific model when its dataset contains over 50,000 rows and is under 800MB: locate that model on the Leaderboard and click the Run link in the model’s Cross Validation column. 

lhaviland_9-1614556861055.png

More information for DataRobot users:

  • search in-app Platform Documentation for Data partitioning and validation, then locate more information in the section "K-fold cross-validation (CV)." 
  • search in-app Platform Documentation for Partitioning and model validation, then locate more information in the section "Ratio-preserved partitioning (Stratified)." 

Is there documentation for the hyperparameters?

Yes. All hyperparameters for an algorithm are documented in the DataRobot Model Docs. To access hyperparameter documentation for a specific model, click the model box from that model's blueprint.

lhaviland_10-1614556861078.png

More information for DataRobot users: search in-app Platform Documentation for Advanced Tuning, and locate information for “parameters” and “hyperparameters.” Also search for Blueprints.

What does DataRobot mean by blueprint vs model? Is this an important distinction?

A modeling algorithm fits a model to data, which is just one component of a blueprint, whereas a blueprint is essentially the end-to-end pipeline or recipe.

The blueprint also consists of data preprocessing. This is a vital difference, especially if you've found yourself saying "It looks like I still have to prepare my data for modeling, but I spent 80% of my time doing that today and DataRobot automated only the other 20%."

It's important to explain that the 80% actually consists of two parts:

(a) a big SQL join to create the flat file, and

(b) making that flat file "model-ready," which may include encoding categoricals, transforming numerics, imputing missing values, parsing words/ngrams, etc.

The process of making the data model-ready also depends on which algorithm you're going to use. After you create the flat file for your data, DataRobot applies one or more different approaches for each algorithm to make it model-ready.

lhaviland_11-1614556860991.png

More information for DataRobot users: search in-app Platform Documentation for Blueprints.

What is the benefit of having many model workers?

With more modeling workers, you can build more models in parallel which directly equates to less time needed for a project to finish. Also, if you are building models in two or more projects, you can allocate workers between them. For example, if you have 10 modeling workers at your disposal and are working on two projects, you can increase the number of workers for both projects up to 10; however, now they are competing with each other for the next available worker. As an alternative, you could allocate 5 workers to one project and 5 to the other, thereby ensuring that each project has workers. (You can view and change worker allocations in the Worker Queue.)

lhaviland_12-1614556860966.png

More information for DataRobot users: search in-app Platform Documentation for Worker Queue.

When downloading batch predictions of the training data, how can I see the original correct labels in the prediction download (i.e., in the CSV file)?

You can add up to five features to the predictions when you download the CSV file. Use this option to add features such as reference IDs or the actual labels from the original training data.

lhaviland_13-1614556861057.png

More information for DataRobot users: search in-app Platform Documentation for Make Predictions tab.

Why are there models in the repository that didn't get run?

Autopilot runs a sample of the models that will give a good balance of accuracy and runtime. Models that offer the possibility of improvement while also potentially increasing in runtime (e.g., Deep Learning Classifiers) are held in the Repository. It is a good practice to run Autopilot, identify the algorithm that performed best on the data, and then run all variants of that algorithm in the Repository. You can also run on Comprehensive Mode which will run all models from the Repository but it may take a while to fully finish.

lhaviland_14-1614556860996.png 

Can you explain the differences between the different modeling modes, and suggestions for when to use each mode?

Modeling modes determine which blueprints are run and how much of the data is used. These are accessible from the dropdown menu that is located just below the Start button.

  • In full Autopilot mode, DataRobot selects and runs the best predictive blueprints given the distribution of the target variable and all the other variables in your data; it does this in a survival-of-the-fittest mode. Sample sizes are typically 16%, 32%, and 64%.
  • In Quick mode (the default), a subset of the Autopilot blueprints are run against, typically, 32% and 64% of the data.
  • In Manual mode, you choose which specific blueprints are used. 
  • Comprehensive mode runs all Repository blueprints on the maximum Autopilot sample size to ensure more accuracy for models. (Available for supervised AutoML projects only.)

lhaviland_15-1614556861047.png

In general, (full) Autopilot mode can run everything automatically with minimal need for user intervention other than to select a target and start the process. Quick mode is Autopilot but applied to a smaller subset of the blueprints to give you a base set of models and insights quickly. Comprehensive mode (also Autopilot) results in extended build times but can ensure more accuracy in models. (Comprehensive Autopilot mode is not available for time series or unsupervised projects.) Manual mode is exactly that; after DataRobot creates blueprints, you manually select which to run, which feature lists to use, what sample size to use, etc.

Modeling projects often require iteration. Comprehensive and Autopilot modes take longer to run but are the most powerful, while Quick and Manual modes take less time.

We recommend you start DataRobot with Autopilot modeling mode for the first iteration. You can make observations and get ideas to make improvements, e.g., feature engineering, joining new features, etc. For the next few iterations, we recommend running Manual mode and refitting only those Blueprints that performed the best on the initial Autopilot run. After a few iterations, after the dataset has been modified and enriched fairly significantly, we recommend re-running Autopilot, as other algorithms might now do better than they did initially. Repeat this process, alternating between Autopilot and Manual modes, to more quickly build out the most accurate model. Finally, for supervised AutoML projects, you can run Comprehensive mode to get the most accurate model for your use case.

More information for DataRobot users: for AutoML projects, search in-app Platform Documentation for Modeling workflow, then locate information for “Setting the modeling mode for AutoML projects.” For time-aware projects (time series or OTV), instead search for Autopilot in time-aware projects.

How does DataRobot determine which AutoML models to train on 16/32/64/80 percent of data?

These are the default sample sizes DataRobot uses for Autopilot.

  • For smaller datasets with less than 2000 rows, Autopilot will use a 64% sample size
  • For smaller datasets with between 2001 and 3999 rows, Autopilot will use both 32% and 64% sample sizes

Above that size cutoff, DataRobot will train every model starting at 16%. After scoring the models, DataRobot will choose the top 16 to train at 32% in the next round. DataRobot will then train the top 8 models from the previous round at 64%. Finally, from the top 8 models, DataRobot chooses 1 for deployment and retrains it at 80%.

These values are adjustable. Once a model has been created you can retrain that model at any percentage of the dataset, either from the Leaderboard or from the Repository.

lhaviland_16-1614556861048.png

 More information for DataRobot users:

  • search in-app Platform Documentation for Modeling workflow then locate information in the section "Setting the modeling mode for AutoML projects."
  • search in-app Platform Documentation for Modeling process details then locate information in the section "Working with small datasets." 

Also, for time-aware projects, search in-app Platform Documentation for Autopilot in time-aware projects, then locate information in the section "Working with small datasets." 

How do the different blenders work?

A blender model is a model that increases accuracy by combining the predictions of two or more models. DataRobot supports several types of blenders including:

  • Average types that average together the outputs of the sub-models:
    • Average Blend (AVG)
    • Median Blend (MED)
    • Mean Absolute Error-Minimizing Weighted Average Blend (MAE)
    • Mean Absolute Error-Minimizing Weighted Average Blend with L1 Penalty (MAEL1)
    • Advanced Average Blend (Advanced AVG)
  • Model types that add a second layer of models on top of the submodels, using the submodel predictions as their predictors and the same target feature:
    • Partial Least Squares Blend (PLS)
    • Generalized Linear Model Blend (GLM)
    • Elastic Net Blend (ENET)
    • Random Forest Blend (RF)
    • TensorFlow Blend (TF)
    • LightGBM Blend (LGBM)
    • Advanced Generalized Linear Model Blend (Advanced GLM)
    • Advanced Elastic Net Blend (Advanced ENET)

This following is an example of a GLM blender that blends a Random Forest classifier and two XGBoost Classifiers:

lhaviland_17-1614556860998.png

More information for DataRobot users:

  • search in-app Platform Documentation for Leaderboard overview, then locate more information in the section "Understanding blender models."
  • search in-app Platform Documentation for Add and delete models, then locate more information in the section "Creating a blended model." 

What are stacked predictions?

The stacked predictions technique leverages DataRobot's cross-validation functionality to build multiple models on different subsets of the training data—a model for each of the validation folds. It guarantees the models will use a fold of data they weren't trained on to make predictions, so they are effectively out-of-sample.

Without this technique, predictions made on training data (i.e., in-sample) would give misleadingly high accuracy as the model would be predicting for answers it has already learned from the training data . Such predictions are overly optimistic and thus not useful.

More information for DataRobot users: search in-app Platform Documentation for Make Predictions tab, then locate more information in the section "Understanding stacked predictions."

Can I rerun a model on a different feature list or sample size?

Yes, you can do so directly from the Leaderboard. At the top of the Leaderboard, click Add new model. You'll be able to select the model you want to re-run from the dropdown. You can also change the feature list, the sample size, and the number of cross-validation runs.

lhaviland_18-1614556860993.png

This can also be done from the Repository.

More information for DataRobot users: search in-app Platform Documentation for Add and delete models, then locate more information for "Retraining a model." 

Do I have to run one at a time from the Repository, or can I run several at once?

You can select any number of blueprints from the Repository. You are also able to choose what feature list you want to use, the sample size, and how many cross-validation runs to perform. When you click Run Task, DataRobot takes care of the rest.

lhaviland_19-1614556860977.png

More information for DataRobot users: search in-app Platform Documentation for Model Repository.

How do I create my own blenders?

Creating a blender model in DataRobot is very straightforward. On the Leaderboard, choose the models you would like to blend and from the Menu choose the blending method. (Note that depending on the types of models you've chosen, some of the blending methods may be greyed out.) 

lhaviland_20-1614556861063.png

DataRobot creates a new blender model and makes it available on the Leaderboard.

lhaviland_21-1614556861052.png

 More information for DataRobot users: search in-app Platform Documentation for Add and delete models, then locate more information under the process for "Creating a blended model." 

How can I filter the Leaderboard?

There are two methods to filter models on the Leaderboard: using stars or tags.

You can apply a star to any model and any number of models. Then, use the Starred Models filter to quickly list only the starred models on the Leaderboard.

lhaviland_22-1614556860978.png 

The second filtering method can be done by clicking on any tag or combination tags (or badges) that DataRobot applies to the models. This will automatically filter the Leaderboard to display only the models with those tags/badges. 

lhaviland_23-1614556860995.png

And you can use these filters together to further narrow the results:

lhaviland_24-1614556862458.png

More information for DataRobot users: search in-app Platform Documentation for Leaderboard overview, then locate more information in the section "Understanding Leaderboard components."

Can you provide some examples of “guardrails” that DataRobot provides to guide the user and ensure the usability of the model produced?

DataRobot enforces guardrails to ensure machine learning best practices are followed and proactively protects against human error.

Some of the guardrails DataRobot applies:

  • Automated detection of target leakage and removal of leaking features; that is, using features during training that wouldn't be available during prediction.
    lhaviland_25-1614556861068.png
    lhaviland_26-1614556861049.png
  • Automated data partitioning to prevent model overfitting while ensuring the highest possible model accuracy.
    lhaviland_27-1614556860972.png
  • Automated data drift tracking as part of managing models deployed to production, since data can change over time and affect the accuracy of model predictions.
    lhaviland_28-1614556861071.png

More information for DataRobot users: search in-app Platform Documentation for Data Quality Assessment. Data partitioning and validation, Feature lists (and search for “Informative Features - Leakage Removed”), and Data Drift tab.

Why can’t I see the content of a given node in a Blueprint?

The ability to censor blueprints enables DataRobot to protect its intellectual property. Blueprints won't show the full extent of DataRobot’s sophisticated data preprocessing and feature engineering by default. Instead, those details are replaced by a box in the blueprint to indicate the preprocessing step without exposing details.

Uncensored blueprints can be enabled for DataRobot On-Premise AI Cluster, Private AI Cloud, and Hybrid AI Cloud customers to give them greater insight into the DataRobot modeling process. To have it disabled (and expose those hidden details), you need to reach out to your DataRobot Account Team or Customer Support.

For example, the following censored blueprint indicates that preprocessing relevant to a tree-based algorithm was done without showing the precise steps of the process.

lhaviland_29-1614556861081.png 

How does DataRobot handle missing data?

DataRobot handles missing data in different ways depending on the blueprint. Some models, such as XGBoost and LightGBM, natively handle missing values.

For numerical features, DataRobot handles missing data as follows:

Linear Models (e.g., Linear Regression, SVM)

  • Data missing-at-random
    • Impute missing numerical values using the median of the non-missing data
  • Data missing-not-at-random
    • Add a binary missing value flag for each feature with such values, allowing the model to recognize the structural pattern and learn from it

Tree-based Models (e.g., Random Forest)

  • Feature is missing 10% or more values
    • Impute an arbitrary value (-9999 by default) rather than the median, as model building will be faster without affecting accuracy.
    • This default value can be changed in Evaluate > Advanced Tuning
  • Feature is missing less than 10% of its values
    • Impute the median

For categorical features in all models, DataRobot treats missing values as another level in the categories. The threshold for the minimum number of missing elements in a feature to trigger imputation can be changed in the Advanced Options after uploading a dataset. The default value is 10.

The imputed missing values of a model can be seen under the Missing Values tab on the Leaderboard. The values shown are calculated from the first cross-validation fold (CV1).

lhaviland_30-1614556860973.png

More information for DataRobot users: search in-app Platform Documentation for Data Quality Assessment (and locate information for "Disguised missing values") and Modeling process details (and locate information for "Handling missing values.”)

How does DataRobot decide which blenders to run?

DataRobot supports several ways of blending models and runs five blenders by default in any Autopilot run: three regular and two advanced. The blending methods chosen depend on dataset size, optimization metric, and project type (classification/regression vs. time series). Additionally, you can create new blenders by selecting specific models with the Menu and choosing a blending method to ensemble them.

The default blenders are:

  • Average Blend
  • GLM Blend
  • ENET Blend
  • Advanced Average Blend
  • Advanced GLM Blend

More information for DataRobot users: search in-app Platform Documentation for Leaderboard overview, then locate more information in the section "Understanding blender models."

Are observations dropped or shuffled during computation of Feature Impact?

Observations are not dropped. Rather, DataRobot uses up to 2500 rows of the training data to calculate Feature Impact. DataRobot will then apply Permutation Importance, which randomly shuffles a single feature (column) at a time while leaving the other features unchanged. The scores are normalized such that the most impactful feature has a value of 1.0.

lhaviland_31-1614556860982.png

More information for DataRobot users: search in-app Platform documentation for Feature Impact, then locate more information in the section "How Feature Impact is calculated."

Can you explain OTV?

OTV, or Out-of-Time Validation, is date/time partitioning. OTV splits data according to a date/time feature for training and validation. With OTV, you train on records from earlier in time and validate on records from later in time. This prevents using "future" observations to train a model, which can cause target leakage and lead to overly optimistic predictions.

lhaviland_32-1614556861000.png

More information for DataRobot users: search in-app Platform Documentation for Date/time partitioning (and out-of-time validation).

I have a target variable made up of continuous values. It has a lot of zeros, too. Can I build a regression project with it and use smart downsampling?

Yes, Smart Downsampling can be used on regression problems where the target is zero-inflated. DataRobot handles this type of problem quite well and out-of-the-box. This type of distribution is commonly known as the Tweedie distribution and presents itself with a large proportion of the records having a response value of zero. If this is not the case, Smart Downsampling can't be used on the regression problem.

lhaviland_33-1614556861076.png

More information for DataRobot users: search in-app Platform documentation for Smart Downsampling.

Does DataRobot do any upsampling like SMOTE?

DataRobot models handle class imbalances without the need for upsampling. Class imbalance is an issue if you evaluate the models using simple metrics like accuracy. DataRobot directly optimizes the models for objectives that are both aligned with the project metric and robust to imbalanced targets (such as LogLoss). If the project metric is different, e.g., AUC, it is used afterwards to fine-tune the hyperparameters of the models. Upsampling introduces additional risk to model performance which is why DataRobot does not natively have this option.

More information for DataRobot users: search in-app Platform Documentation for Smart Downsampling

For a categorical variable with N levels, how many indicator variables does DataRobot’s one-hot encoding create?

DataRobot encodes categorical variables using one-hot encoding. In this case, it will produce N indicators or dummy variables. For example, if the categorical variable has five levels, then DataRobot will produce 5 indicators.

How are new models tested and evaluated?

At a high level, all potential new models to be added inside DataRobot are tested against many curated, well-defined datasets designed to fully stress-test any model.

The test has three broad components that must all be passed:

  1. The model must never crash.
  2. The model must never produce inconsistent predictions across different runs on the same dataset.
  3. The model must provide an improvement in accuracy, runtime, or RAM usage over existing models.

These tests are run with varying parameters in multiple environments (cloud, OS, Hadoop distributions, etc.) and compared against known baseline tests. The baselines are revalidated frequently (monthly to quarterly) to ensure they incorporate the latest improvements. Additional post-model training validation checks are also done, including Prediction Explanations, Prime models, model retraining into holdout, and prediction tests.

New algorithms undergo much more scrutiny and extensive testing to validate the value they can add to the DataRobot platform. Ultimate signoff depends on test results and the expertise of the DataRobot Data Science team.

How many models does DataRobot Run? What kind of predictive algorithms does DataRobot run?

DataRobot can potentially create and run millions of different models; however, not all of these models will be listed in the Repository or be available to run with Autopilot. Which models get created and subsequently run depends on the dataset and the problem to be solved. Also, note that data preprocessing (one of the steps in the DataRobot blueprint) is considered by DataRobot to be a key piece of modeling. Multiple blueprints can all use the same machine learning algorithm, but use completely different preprocessing steps and parameters and are therefore uniquely different modeling strategies

DataRobot makes use of many types of machine learning algorithms, including GLMs, Support Vector Machines, Gaussian Processes, Tree Based Models, Deep Learning Models (e.g., Tensorflow), Anomaly Detection, Text Mining. Other supported algorithms  include K-Nearest Neighbors, Generalized Additive Models (Rating Tables), and DataRobot Eureqa (proprietary, patented genetic algorithm).

Version history
Last update:
‎02-28-2021 07:28 PM
Updated by:
Contributors