cancel
Showing results for 
Search instead for 
Did you mean: 

Multicollinearity dataset in Regression

Highlighted
Blue LED

Hi,

Can datarobot use the input file as is when trying to use regression from the current dataset below?

ex)

The current dataset (I'm worried about Multicollinearity issues in the current dataset)

idcolor_yellowcolor_bluecolor_blacktarget
11000
20100
30101
40011
50011

 

Original dataset

idcolortarget
1yellow0
2blue0
3blue1
4black1
5black1

 

Thanks.

Labels (2)
0 Kudos
4 Replies
Highlighted
Data Scientist
Data Scientist

Hello Ippo, 

 

There are a number of ways that our blueprints intrinsically deal with correlated features. 

Some examples: 

 

  1. If features are correlated some blueprints will use PCA to be-correlated and standardize the features before modeling. 
  2. Other models use L1 (lasso) or L2 (ridge) regression to naturally handle correlated features.  These models deal with correlated features by penalizing them. 
  3. Tree based models are generally robust to correlated features as well.  

 

However, if you have domain expertise that allows you to know that two variables are correlated prior to model, then it may make sense to manually remove them before you start. 

 

The example you showed below doesn't concern me though - this appears to be a process called one-hot encoding. 

In this case you are taking a categorical variable that is one-hot-encoded and putting it into a single column.  If you use the bottom data example, then DataRobot will do the one-hot-encoding for you in the blueprints.  Feel free just to use the single column that includes all of the colors and the platform should handle that well. 

 

Here is a visual I use sometimes for explaining one-hot-encoding: 

 

onehotencoding.png

 
 

I hope this helps, 

 

Emily

Highlighted
Data Scientist
Data Scientist

Hi Ippo,

Following up on Emily's response, I think the question you're  also asking is, "if you do one hot encoding, should you drop one of the categories, since it would create multicollinearity?"

So this would be true if you were using the OLS approach. However as Emily pointed out in her reply, when you use regularization techniques, which is very common in ML, you don't need to drop one hot encoded columns. Also as Emily pointed out, since DataRobot already handles categorical variables using multiple transformation techniques, beyond one-hot encoding, you will probably save yourself a lot of time by letting DataRobot do the transformations for you and seeing what works. Thanks for posting your question!

Highlighted
Blue LED

Thanks Emily/yong

 

You are right.                                                                                                                                                                         

My question is, "If I have one hot encoding, should I drop one of the categories to remove multicollinearity?"     

In conclusion, I understand as follows, Is it correct?

 

Datarobot input data

The current dataset(one hot encoding)

OLS approach : drop one of the categories                                                                                                     

regularization techniques : Don't need to drop one of the categories

 

Original dataset

OLS approach : don't need to drop one of the categories                                                                             

regularization techniques : don't need to drop one of the categories

 

Regards,

0 Kudos
Highlighted
Data Scientist
Data Scientist

One clarification with the original dataset. If you are not using DataRobot you probably have to do some sort of encoding with categorical features manually, depending on what application you use. So if you end up doing one hot encoding manually, you should drop one of the categories if you are using OLS. With DataRobot since it will do a number of different transformations, including one hot encoding, depending on the type of algorithm, it is not necessary for you to do any manual encoding since we have techniques and training methodology that can handle multicollinearity. 

At a higher level, generally in modern machine learning, since datasets are large with complex non linear relationships, and the goal is to find a model that generalizes well to unseen data, the iterative approach is preferred over OLS method.