(Updated February 2021)
This section provides answers to frequently asked questions related to building models. If you don't find an answer for your question, you can ask it now; use Post your Comment (below) to get your question answered.
DataRobot doesn't "select" blueprints; it creates them dynamically based on the dataset that you give it and the target you specify.
Data scientists at DataRobot have spent over 8 years embedding their data science knowledge and best practices into this blueprint development process. Hundreds of blueprints are created for each project.
For most data science problems, you should come up with multiple viable approaches without knowing which ones will perform the best. DataRobot uses the Leaderboard to compare blueprints and identify the best ones.
More information for DataRobot users: search in-app Platform documentation for Leaderboard overview.
DataRobot shows the effect of features—engineered/generated and non-engineered, raw—through the Insights tab.
Here, both Tree-Based Variable Importance as well as Variable Effects show the variables used by the models.
It’s important to note that all features—those imported with the dataset (i.e., raw features) and those created by transformations—can be evaluated for predictive signal and interpreted for their relationship to the target variable.
See Data menu > Project Data tab. Features created by variable transformation are indicated with an info (i) icon, for example:
In contrast, features that DataRobot generated as part of Time Series or Feature Discovery are not listed in the Project Data tab with the info (i) icon.
Instead, DataRobot creates new feature lists for those newly created time series features:
And features generated as part of Feature Discovery will be listed in the Project Data table with names consisting of the dataset alias and type of transformation. If you select one of these newly generated features, you can view the Feature Lineage (visual description of how the feature was generated:
More information for DataRobot users: search in-app Platform documentation for DataRobot insights, and locate information in the sections “Using Tree-Based Variable Importance” and “Variable Effects.” You can also search for Time series modeling, Feature lists, Generated features, and Feature transformations.
Yes, you can change some of the preprocessing methods used within DataRobot on the Advanced Tuning tab of each model and even perform a grid search of the best hyperparameters to explore. If you have a custom preprocessing step that you want to include, you could do it outside of DataRobot and load in the processed dataset. This becomes seamless when using the DataRobot R or Python clients.
More information for DataRobot users: search in-app Platform documentation for Coefficients (and preprocessing details) and locate information for "Matrix of token occurrences."
The asterisks indicate that the scores are computed from stacked predictions on the model's training data. Stacked predictions is DataRobot’s method for ensuring predictions from training data don’t have misleadingly high accuracy. (To see information for an asterisk, just hover over its tooltip.)
More information for DataRobot users: search in-app Platform Documentation for Leaderboard overview and locate information for “Understanding asterisked scores.”
This indicates a frozen run. A frozen run of a model is a retrained version of another model, where hyperparameters from the other model are frozen and the model is simply retrained on more observations. The sample percentage used to obtain the parameters is displayed after the snowflake.
(Another way to understand this is as parent and child models. In this scenario, a child model is the frozen run of a retrained parent model. The hyperparameters from the parent model are frozen and the child model is simply retrained on more observations.)
For example, this model (child model) was based on “frozen” parameter settings from the 80% sample size version of the model (parent model):
More information for DataRobot users: search in-app Platform Documentation for Frozen runs.
Cross-validation has a hard cutoff at 50,000 rows; if the dataset is greater than or equal to 50,000 rows, DataRobot does not run cross-validation automatically.
More information for DataRobot users: search in-app Platform Documentation for Data partitioning and validation.
By default, DataRobot will use 80% of a dataset split into 5 folds with stratified sampling (each fold preserves the original ratio of the target values). The remaining 20% of the dataset is reserved as a holdout. The sizes and number of folds can be adjusted via Advanced Settings.
DataRobot will automatically carry out cross validation (CV) if the dataset is fewer than 50,000 rows; otherwise, it will use training, validation, and holdout (TVH) partitioning. (Cross-validation is generally useful for smaller datasets where you would not otherwise have enough useful data using TVH.) If the dataset is larger than 800MB, CV won't be allowed and TVH has to be used.
To manually run CV (rather than TVH) for a specific model when its dataset contains over 50,000 rows and is under 800MB: locate that model on the Leaderboard and click the Run link in the model’s Cross Validation column.
More information for DataRobot users:
Yes. All hyperparameters for an algorithm are documented in the DataRobot Model Docs. To access hyperparameter documentation for a specific model, click the model box from that model's blueprint.
More information for DataRobot users: search in-app Platform Documentation for Advanced Tuning, and locate information for “parameters” and “hyperparameters.” Also search for Blueprints.
A modeling algorithm fits a model to data, which is just one component of a blueprint, whereas a blueprint is essentially the end-to-end pipeline or recipe.
The blueprint also consists of data preprocessing. This is a vital difference, especially if you've found yourself saying "It looks like I still have to prepare my data for modeling, but I spent 80% of my time doing that today and DataRobot automated only the other 20%."
It's important to explain that the 80% actually consists of two parts:
(a) a big SQL join to create the flat file, and
(b) making that flat file "model-ready," which may include encoding categoricals, transforming numerics, imputing missing values, parsing words/ngrams, etc.
The process of making the data model-ready also depends on which algorithm you're going to use. After you create the flat file for your data, DataRobot applies one or more different approaches for each algorithm to make it model-ready.
More information for DataRobot users: search in-app Platform Documentation for Blueprints.
With more modeling workers, you can build more models in parallel which directly equates to less time needed for a project to finish. Also, if you are building models in two or more projects, you can allocate workers between them. For example, if you have 10 modeling workers at your disposal and are working on two projects, you can increase the number of workers for both projects up to 10; however, now they are competing with each other for the next available worker. As an alternative, you could allocate 5 workers to one project and 5 to the other, thereby ensuring that each project has workers. (You can view and change worker allocations in the Worker Queue.)
More information for DataRobot users: search in-app Platform Documentation for Worker Queue.
You can add up to five features to the predictions when you download the CSV file. Use this option to add features such as reference IDs or the actual labels from the original training data.
More information for DataRobot users: search in-app Platform Documentation for Make Predictions tab.
Autopilot runs a sample of the models that will give a good balance of accuracy and runtime. Models that offer the possibility of improvement while also potentially increasing in runtime (e.g., Deep Learning Classifiers) are held in the Repository. It is a good practice to run Autopilot, identify the algorithm that performed best on the data, and then run all variants of that algorithm in the Repository. You can also run on Comprehensive Mode which will run all models from the Repository but it may take a while to fully finish.
Modeling modes determine which blueprints are run and how much of the data is used. These are accessible from the dropdown menu that is located just below the Start button.
In general, (full) Autopilot mode can run everything automatically with minimal need for user intervention other than to select a target and start the process. Quick mode is Autopilot but applied to a smaller subset of the blueprints to give you a base set of models and insights quickly. Comprehensive mode (also Autopilot) results in extended build times but can ensure more accuracy in models. (Comprehensive Autopilot mode is not available for time series or unsupervised projects.) Manual mode is exactly that; after DataRobot creates blueprints, you manually select which to run, which feature lists to use, what sample size to use, etc.
Modeling projects often require iteration. Comprehensive and Autopilot modes take longer to run but are the most powerful, while Quick and Manual modes take less time.
We recommend you start DataRobot with Autopilot modeling mode for the first iteration. You can make observations and get ideas to make improvements, e.g., feature engineering, joining new features, etc. For the next few iterations, we recommend running Manual mode and refitting only those Blueprints that performed the best on the initial Autopilot run. After a few iterations, after the dataset has been modified and enriched fairly significantly, we recommend re-running Autopilot, as other algorithms might now do better than they did initially. Repeat this process, alternating between Autopilot and Manual modes, to more quickly build out the most accurate model. Finally, for supervised AutoML projects, you can run Comprehensive mode to get the most accurate model for your use case.
More information for DataRobot users: for AutoML projects, search in-app Platform Documentation for Modeling workflow, then locate information for “Setting the modeling mode for AutoML projects.” For time-aware projects (time series or OTV), instead search for Autopilot in time-aware projects.
These are the default sample sizes DataRobot uses for Autopilot.
Above that size cutoff, DataRobot will train every model starting at 16%. After scoring the models, DataRobot will choose the top 16 to train at 32% in the next round. DataRobot will then train the top 8 models from the previous round at 64%. Finally, from the top 8 models, DataRobot chooses 1 for deployment and retrains it at 80%.
These values are adjustable. Once a model has been created you can retrain that model at any percentage of the dataset, either from the Leaderboard or from the Repository.
More information for DataRobot users:
Also, for time-aware projects, search in-app Platform Documentation for Autopilot in time-aware projects, then locate information in the section "Working with small datasets."
A blender model is a model that increases accuracy by combining the predictions of two or more models. DataRobot supports several types of blenders including:
This following is an example of a GLM blender that blends a Random Forest classifier and two XGBoost Classifiers:
More information for DataRobot users:
The stacked predictions technique leverages DataRobot's cross-validation functionality to build multiple models on different subsets of the training data—a model for each of the validation folds. It guarantees the models will use a fold of data they weren't trained on to make predictions, so they are effectively out-of-sample.
Without this technique, predictions made on training data (i.e., in-sample) would give misleadingly high accuracy as the model would be predicting for answers it has already learned from the training data . Such predictions are overly optimistic and thus not useful.
More information for DataRobot users: search in-app Platform Documentation for Make Predictions tab, then locate more information in the section "Understanding stacked predictions."
Yes, you can do so directly from the Leaderboard. At the top of the Leaderboard, click Add new model. You'll be able to select the model you want to re-run from the dropdown. You can also change the feature list, the sample size, and the number of cross-validation runs.
This can also be done from the Repository.
More information for DataRobot users: search in-app Platform Documentation for Add and delete models, then locate more information for "Retraining a model."
You can select any number of blueprints from the Repository. You are also able to choose what feature list you want to use, the sample size, and how many cross-validation runs to perform. When you click Run Task, DataRobot takes care of the rest.
More information for DataRobot users: search in-app Platform Documentation for Model Repository.
Creating a blender model in DataRobot is very straightforward. On the Leaderboard, choose the models you would like to blend and from the Menu choose the blending method. (Note that depending on the types of models you've chosen, some of the blending methods may be greyed out.)
DataRobot creates a new blender model and makes it available on the Leaderboard.
More information for DataRobot users: search in-app Platform Documentation for Add and delete models, then locate more information under the process for "Creating a blended model."
There are two methods to filter models on the Leaderboard: using stars or tags.
You can apply a star to any model and any number of models. Then, use the Starred Models filter to quickly list only the starred models on the Leaderboard.
The second filtering method can be done by clicking on any tag or combination tags (or badges) that DataRobot applies to the models. This will automatically filter the Leaderboard to display only the models with those tags/badges.
And you can use these filters together to further narrow the results:
More information for DataRobot users: search in-app Platform Documentation for Leaderboard overview, then locate more information in the section "Understanding Leaderboard components."
DataRobot enforces guardrails to ensure machine learning best practices are followed and proactively protects against human error.
Some of the guardrails DataRobot applies:
More information for DataRobot users: search in-app Platform Documentation for Data Quality Assessment. Data partitioning and validation, Feature lists (and search for “Informative Features - Leakage Removed”), and Data Drift tab.
The ability to censor blueprints enables DataRobot to protect its intellectual property. Blueprints won't show the full extent of DataRobot’s sophisticated data preprocessing and feature engineering by default. Instead, those details are replaced by a box in the blueprint to indicate the preprocessing step without exposing details.
Uncensored blueprints can be enabled for DataRobot On-Premise AI Cluster, Private AI Cloud, and Hybrid AI Cloud customers to give them greater insight into the DataRobot modeling process. To have it disabled (and expose those hidden details), you need to reach out to your DataRobot Account Team or Customer Support.
For example, the following censored blueprint indicates that preprocessing relevant to a tree-based algorithm was done without showing the precise steps of the process.
DataRobot handles missing data in different ways depending on the blueprint. Some models, such as XGBoost and LightGBM, natively handle missing values.
For numerical features, DataRobot handles missing data as follows:
Linear Models (e.g., Linear Regression, SVM)
Tree-based Models (e.g., Random Forest)
For categorical features in all models, DataRobot treats missing values as another level in the categories. The threshold for the minimum number of missing elements in a feature to trigger imputation can be changed in the Advanced Options after uploading a dataset. The default value is 10.
The imputed missing values of a model can be seen under the Missing Values tab on the Leaderboard. The values shown are calculated from the first cross-validation fold (CV1).
More information for DataRobot users: search in-app Platform Documentation for Data Quality Assessment (and locate information for "Disguised missing values") and Modeling process details (and locate information for "Handling missing values.”)
DataRobot supports several ways of blending models and runs five blenders by default in any Autopilot run: three regular and two advanced. The blending methods chosen depend on dataset size, optimization metric, and project type (classification/regression vs. time series). Additionally, you can create new blenders by selecting specific models with the Menu and choosing a blending method to ensemble them.
The default blenders are:
More information for DataRobot users: search in-app Platform Documentation for Leaderboard overview, then locate more information in the section "Understanding blender models."
Observations are not dropped. Rather, DataRobot uses up to 2500 rows of the training data to calculate Feature Impact. DataRobot will then apply Permutation Importance, which randomly shuffles a single feature (column) at a time while leaving the other features unchanged. The scores are normalized such that the most impactful feature has a value of 1.0.
More information for DataRobot users: search in-app Platform documentation for Feature Impact, then locate more information in the section "How Feature Impact is calculated."
OTV, or Out-of-Time Validation, is date/time partitioning. OTV splits data according to a date/time feature for training and validation. With OTV, you train on records from earlier in time and validate on records from later in time. This prevents using "future" observations to train a model, which can cause target leakage and lead to overly optimistic predictions.
More information for DataRobot users: search in-app Platform Documentation for Date/time partitioning (and out-of-time validation).
Yes, Smart Downsampling can be used on regression problems where the target is zero-inflated. DataRobot handles this type of problem quite well and out-of-the-box. This type of distribution is commonly known as the Tweedie distribution and presents itself with a large proportion of the records having a response value of zero. If this is not the case, Smart Downsampling can't be used on the regression problem.
More information for DataRobot users: search in-app Platform documentation for Smart Downsampling.
DataRobot models handle class imbalances without the need for upsampling. Class imbalance is an issue if you evaluate the models using simple metrics like accuracy. DataRobot directly optimizes the models for objectives that are both aligned with the project metric and robust to imbalanced targets (such as LogLoss). If the project metric is different, e.g., AUC, it is used afterwards to fine-tune the hyperparameters of the models. Upsampling introduces additional risk to model performance which is why DataRobot does not natively have this option.
More information for DataRobot users: search in-app Platform Documentation for Smart Downsampling.
DataRobot encodes categorical variables using one-hot encoding. In this case, it will produce N indicators or dummy variables. For example, if the categorical variable has five levels, then DataRobot will produce 5 indicators.
At a high level, all potential new models to be added inside DataRobot are tested against many curated, well-defined datasets designed to fully stress-test any model.
The test has three broad components that must all be passed:
These tests are run with varying parameters in multiple environments (cloud, OS, Hadoop distributions, etc.) and compared against known baseline tests. The baselines are revalidated frequently (monthly to quarterly) to ensure they incorporate the latest improvements. Additional post-model training validation checks are also done, including Prediction Explanations, Prime models, model retraining into holdout, and prediction tests.
New algorithms undergo much more scrutiny and extensive testing to validate the value they can add to the DataRobot platform. Ultimate signoff depends on test results and the expertise of the DataRobot Data Science team.
DataRobot can potentially create and run millions of different models; however, not all of these models will be listed in the Repository or be available to run with Autopilot. Which models get created and subsequently run depends on the dataset and the problem to be solved. Also, note that data preprocessing (one of the steps in the DataRobot blueprint) is considered by DataRobot to be a key piece of modeling. Multiple blueprints can all use the same machine learning algorithm, but use completely different preprocessing steps and parameters and are therefore uniquely different modeling strategies
DataRobot makes use of many types of machine learning algorithms, including GLMs, Support Vector Machines, Gaussian Processes, Tree Based Models, Deep Learning Models (e.g., Tensorflow), Anomaly Detection, Text Mining. Other supported algorithms include K-Nearest Neighbors, Generalized Additive Models (Rating Tables), and DataRobot Eureqa (proprietary, patented genetic algorithm).