This article covers the options you have when creating a project in DataRobot. There are several tabs within AdvancedOptions that allow you to customize modeling.
After you upload your data and select a target variable, DataRobot will automatically choose the best settings for your specific dataset. If you wish to tweak these settings, click the Show Advanced Options link at the bottom of the page.
When you select the Data tab, the Partitioning tab opens. Partitioning is important because it determines which rows of data are part of each validation approach. For example, you wouldn’t want all of the rows that achieve your target in the same partition, because then the model wouldn’t have that class available for training and validation in the other partitions.
DataRobot supports the following partitioning methods:
You can also use our OTV mode to do date/time partitioning. This allows you to train on earlier data and then test on later data.
Figure 1. Partitioning
With random partitioning, DataRobot randomly assigns observations (rows) to the training, validation, and holdout sets.
With stratified partitioning, rows are assigned in a way that ensures similar target distribution across each partition. This is important if you have an imbalanced target feature.
Partition Feature partitioning
With the partition feature option, you can choose a partition feature and DataRobot will create a distinct partition for each unique value of that feature. This is useful when you want DataRobot to respect some partitioning you made outside of DataRobot.
With group partitioning, you choose a group feature and DataRobot ensures that all of the observations with the same value are in the same partition. This sounds similar to feature partitioning but the difference is that with group partitioning, you can have multiple values in the same partition. You will never have the same value in two partitions though.
Smart downsampling is used when you have a big and unbalanced dataset. For example, if you have a large dataset and the minority class only makes up 1% of your target, then it may make sense to reduce the overall sample and do so in a way that reduces your class imbalance.
Simply toggle Downsample Data and adjust the slider to downsample the majority class.
Figure 2. Smart Downsampling
Feature Discovery is a supervised approach to reducing the training dataset to only informative features. This is toggled automatically, and will result in removal of those fields that will likely not contribute to the model. This makes the processing time faster and makes interpretability easier while keeping accuracy about the same. For example, if you have a feature that only has one value for all of the rows, then there isn’t really a pattern so DataRobot will omit that feature from modeling. You can turn off Feature Discovery if you prefer.
Figure 3. Feature Discovery
Feature constraints allows you to introduce monotonicity into your modeling. This is something that you might do if you know the direction of the relationship between the feature and the target. For example, a higher valued home or car should always lead to higher insurance rates.
In order to use this you will need to create a feature list with only numerical features. The ones that have positive relationships should be indicated as Monotonic Increasing and those with inverse relationships should be indicated as Monotonic Decreasing.
Figure 4. Feature Constraints
There are a number of customizations you can carry out under the Additional tab.
You can change the optimization metric used in your project to validate the models and tune the hyperparameters.
Figure 5. Optimization Metric
Automation Settings allows you to customize some of the processes that take place during Autopilot.
These include options around:
Searching for interactions.
Including only blueprints with Scoring Code.
Automatically creating Blenders.
Including only models with SHAP feature impact and prediction explanations rather than permutation importance and XEMP.
Recommending and preparing the top model for Deployment (train on 100% of the data).
Including Blenders in the recommendation.
Using Accuracy Optimized blueprints, which can be much slower but are very accurate.
Including Scaleout models.
Using the informative features-leakage removed feature list by default.
Figure 6. Automation Settings
If you are using a dataset with more than 50,000 rows, DataRobot will use an approach to determine how many models to run cross-validation on automatically. You can override this here. Importantly, if cross-validation was not run automatically on a model, you can do this manually on the Leaderboard.
Figure 6. Cross-Validation
Upper Bound Running Time
You can set a limit for how long a single model can take to run (in hours).
Figure 8. Upper Bound
You can limit the value of the response (target) to a percentile of the original values (between 0.5 and 1).
Figure 9. Response Cap
Random Seed & Positive Class Assignment
You can set the random seed and indicate the positive class.
Figure 10. Random Seed
You can indicate a feature to be used as weights for the different rows. This is especially useful if you are doing smart downsampling.
Figure 11. Weights
Exposure, Count of Events, and Offset
You can set the Exposure, which transforms the feature to add value to predictions. The target must be a positive numeric with a cardinality greater than 2.
You can also specify the Count of Events. This contains the frequency of non-zero events that contributed to the target value. This is used in frequency-severity blueprints only when the target is zero-inflated.
The Offset setting allows you to add the value listed for each feature to the prediction prior to applying the link function.
Figure 12. Exposure, Offset, and Events
There are three modeling modes to choose from:
Figure 13. Modeling Modes
Autopilot mode is the most computationally thorough of the three. In this mode DataRobot will start with a small percentage of the data and apply a variety of different algorithms to the problem. The algorithms that do well will survive to the next round of modeling and will be provided more data. The models from that round that perform the best will get even more data, and so on. This mode is useful when you want to know what the best approach is to solve your specific modeling problem.
Quick mode is a narrower run of the Autopilot mode. This mode (the default mode) starts out using a larger percentage of the data and focuses only on those algorithms that tend to rise to the top of the Leaderboard.
Manual mode allows you to select individual models from the Repository to run. This is useful if you already know which kind of modeling approach you want to use.
If you’re a licensed DataRobot customer, search the in-app documentation for Show Advanced Options link.