Solved: Multicollinearity dataset in Regression - DataRobot Community

ippo · ‎03-16-2020

Hi,

Can datarobot use the input file as is when trying to use regression from the current dataset below?

ex)

The current dataset (I'm worried about Multicollinearity issues in the current dataset)

id	color_yellow	color_blue	color_black	target
1	1	0	0	0
2	0	1	0	0
3	0	1	0	1
4	0	0	1	1
5	0	0	1	1

Original dataset

id	color	target
1	yellow	0
2	blue	0
3	blue	1
4	black	1
5	black	1

Thanks.

yong · ‎03-20-2020

One clarification with the original dataset. If you are not using DataRobot you probably have to do some sort of encoding with categorical features manually, depending on what application you use. So if you end up doing one hot encoding manually, you should drop one of the categories if you are using OLS. With DataRobot since it will do a number of different transformations, including one hot encoding, depending on the type of algorithm, it is not necessary for you to do any manual encoding since we have techniques and training methodology that can handle multicollinearity.

At a higher level, generally in modern machine learning, since datasets are large with complex non linear relationships, and the goal is to find a model that generalizes well to unseen data, the iterative approach is preferred over OLS method.

View solution in original post

emily · ‎03-16-2020

Hello Ippo,

There are a number of ways that our blueprints intrinsically deal with correlated features.

Some examples:

If features are correlated some blueprints will use PCA to be-correlated and standardize the features before modeling.
Other models use L1 (lasso) or L2 (ridge) regression to naturally handle correlated features. These models deal with correlated features by penalizing them.
Tree based models are generally robust to correlated features as well.

However, if you have domain expertise that allows you to know that two variables are correlated prior to model, then it may make sense to manually remove them before you start.

The example you showed below doesn't concern me though - this appears to be a process called one-hot encoding.

In this case you are taking a categorical variable that is one-hot-encoded and putting it into a single column. If you use the bottom data example, then DataRobot will do the one-hot-encoding for you in the blueprints. Feel free just to use the single column that includes all of the colors and the platform should handle that well.

Here is a visual I use sometimes for explaining one-hot-encoding:

I hope this helps,

Emily

yong · ‎03-18-2020

Hi Ippo,

Following up on Emily's response, I think the question you're also asking is, "if you do one hot encoding, should you drop one of the categories, since it would create multicollinearity?"

So this would be true if you were using the OLS approach. However as Emily pointed out in her reply, when you use regularization techniques, which is very common in ML, you don't need to drop one hot encoded columns. Also as Emily pointed out, since DataRobot already handles categorical variables using multiple transformation techniques, beyond one-hot encoding, you will probably save yourself a lot of time by letting DataRobot do the transformations for you and seeing what works. Thanks for posting your question!

ippo · ‎03-19-2020

Thanks Emily/yong

You are right.

My question is, "If I have one hot encoding, should I drop one of the categories to remove multicollinearity?"

In conclusion, I understand as follows, Is it correct?

Datarobot input data

The current dataset(one hot encoding)

OLS approach : drop one of the categories

regularization techniques : Don't need to drop one of the categories

Original dataset

OLS approach : don't need to drop one of the categories

regularization techniques : don't need to drop one of the categories

Regards,

yong · ‎03-20-2020

One clarification with the original dataset. If you are not using DataRobot you probably have to do some sort of encoding with categorical features manually, depending on what application you use. So if you end up doing one hot encoding manually, you should drop one of the categories if you are using OLS. With DataRobot since it will do a number of different transformations, including one hot encoding, depending on the type of algorithm, it is not necessary for you to do any manual encoding since we have techniques and training methodology that can handle multicollinearity.

At a higher level, generally in modern machine learning, since datasets are large with complex non linear relationships, and the goal is to find a model that generalizes well to unseen data, the iterative approach is preferred over OLS method.

Multicollinearity dataset in Regression

Multicollinearity dataset in Regression

Oracle

How to make your own lagged features

Google Ads use case

Feature Generation

Downloaded Predictions do not Match Targets