This article describes different ways to handle multicollinearity within DataRobot.
The concept of multicollinearity comes from traditional linear models such as linear regression. Multicollinearity occurs when there is a linear relationship among several explanatory variables. A special case of multicollinearity is collinearity, where only two explanatory variables have a linear relationship.
Multicollinearity tends to cause problems when you are examining the coefficients of your model. Due to some features being related, you may not have a correct interpretation of the individual effect of each one of them on the outcome of the model.
Having multicollinearity in your model does not affect accuracy; the validation, cross-validation, and holdout scores in DataRobot would still be valid. Multicollinearity is mostly a redundancy and explanatory problem that can be addressed by focusing on approaches that reduce redundancy in the feature list. DataRobot provides several ways to address multicollinearity.
You could use the Feature Association matrix to identify highly associated variables. Then, create a new feature list that removes all but one of the highly associated features. The Feature Association matrix lists up to 50 features from your dataset along both the X-axis and Y-axis (Figure 1). By default, those features are taken from the Informative feature list, sorted by the cluster they belong to, and with mutual information as the association score. Each of these choices can be replaced with other options.
Figure 1. Feature Association matrix
The panel to the right of the Feature Association matrix provides a better view of both the clusters that DataRobot found and the association score for each pair of features (Figure 2).
Figure 2. The association scores for each pair of features
You could use the FeatureImpact tab (below the Understand tab) to identify redundant features and create a new feature list that excludes them.
In the FeatureImpact tab, DataRobot indicates all variables it considers redundant with an alert icon (Figure 3). Each variable in this tab has an associated Feature Impact score, which is computed based on a permutation method that measures the impact each variable has on the performance of a given model.
Figure 3. Feature Impact tab showing two redundant features
To remove redundant features, click Create feature list. Then enter the feature list name and the number of features you want to retain in the new feature list. Select the Exclude redundant features check box to keep them out of the new feature list (Figure 4).
Figure 4. Creating a new feature list that excludes redundant features
Lastly, you could just let DataRobot automatically handle your multicollinearity problem. DataRobot has a variety of modeling algorithms that automatically take care of multicollinearity. For instance, some modeling approaches such as linear regression have an L1 regularization (or Lasso penalty) step that shrinks the coefficients of redundant features to zero (Figure 5), while others have an L2 regularizer (or ridge penalty) that shrinks the same coefficients to very small numbers (Figure 6).
Figure 5. Blueprints with an L1 regularization step
Figure 6. Blueprints with an L2 regularization step
Both regularization approaches attempt to minimize the effect of the redundant features on model interpretability and prediction speed. The L1 penalty does that by effectively removing the redundant feature, while the L2 penalty provides the redundant feature with only minimal ability to contribute to the predictive capability of the model.
In addition, DataRobot has different versions of Tree-based approaches that are also robust to multicollinearity (Figure 7).
Figure 7. A sample of Tree-based approaches
If you’re a licensed DataRobot customer, search the in-app Platform Documentation for Feature Association tab and Feature Impact.