Showing results for 
Search instead for 
Did you mean: 

Remove multicollinearity in data

Data Scientist
Data Scientist

Remove multicollinearity in data

hi team - It is unclear how DR approaches datasets where multicollinearity is presented. Can you share more details? Also what tools and capabilities are available to help the users to remove multicollinearity in the data? Thank you! 

2 Replies
DataRobot Alumni

  1. You could use the Feature Association matrix to identify highly associated variables. Then, create a new feature list that removes all but one of the highly associated features. The Feature Association matrix lists up to 50 features from your dataset along both the X-axis and Y-axis. By default, those features are taken from the Informative feature list, sorted by the cluster they belong to, and with mutual information as the association score. Each of these choices can be replaced with other options. JLI_0-1659010127359.png
  2. You could use the Feature Impact tab (below the Understand tab) to identify redundant features and create a new feature list that excludes them. In the Feature Impact tab, DataRobot indicates all variables it considers redundant with an alert icon. Each variable in this tab has an associated Feature Impact score, which is computed based on a permutation method that measures the impact each variable has on the performance of a given model. To remove redundant features, click Create feature list. Then enter the feature list name and the number of features you want to retain in the new feature list. Select the Exclude redundant features check box to keep them out of the new feature list.JLI_1-1659010127349.png
  3. Lastly, you could just let DataRobot automatically handle your multicollinearity problem. DataRobot has a variety of modeling algorithms that automatically take care of multicollinearity. For instance, some modeling approaches such as linear regression have an L1 regularization (or Lasso penalty) step that shrinks the coefficients of redundant features to zero, while others have an L2 regularizer (or ridge penalty) that shrinks the same coefficients to very small numbers.JLI_2-1659010127399.png

Links to the documentations about creating feature lists and feature associations.

DataRobot Alumni

If you want to go further, you can also go to a given model and then go to Advance setting under Evaluation and set L1 and L2 regularization.  

Screen Shot 2022-07-28 at 10.04.57 AM.png


For instance, for an XGboost under Advance setting you have reg_alpha and reg_beta (L1, L2, respectively). You can provide multiple values and let DataRobot find the best one to reducing or removing the influence of redundant features

Screen Shot 2022-07-28 at 10.12.30 AM.png


Screen Shot 2022-07-28 at 10.09.54 AM.png

0 Kudos