Encoding of categorical variables and imputing

Jayant · ‎08-04-2021

Hello,

We have built a binary classification model(0/1 problem). While going through the Data quality handling report, I could see that some categorical variables have been treated in a particular way. Plz see screenshot.

I have two questions here;

First, regarding imputation. Why are we seeing imputed value:-2 here. I thought that here the value used to impute is missing.
Second, we can see an ordinal encoding done here. Is that the preferred way the models encode on this platform? Is there a way we can choose to perform one-hot encoding here? thanks

Linda · ‎08-05-2021

Hi @Jayant

Have you had a look at the product documentation? Maybe you can get some some answers there, at least before the community reaches out with help.

Here's doc information for the report and imputation

and for one-hot encoding.

Also this Data Quality article.

- linda

There's also a

Jameson · ‎08-06-2021

Hi @Jayant ,

(1) I can't see what type of algorithm is being run from the screenshot, but if it's using Ordinal Encoding I believe it's likely a tree base model. Regardless when using Ordinal Encoding the missing values can be mapped to an arbitrary level which lets the tree see this as potentially useful information. If we simply impute the median/mean value then the fact that it was missing could be lost. So it's common in our experience to impute values like -2 or -9999 for tree based models with Ordinal Encoding.

(2) That said, with Composable ML (https://www.datarobot.com/platform/composable-ml/) you have the ability to change this behavior and swap out one type of encoding method for another. So you could change Ordinal Encoding to OHE and see how it performs.

Hope this helps! Let us know if you have any follow up thoughts/questions.

Jameson

Jameson · ‎08-06-2021

P.S. Can't seem to edit my reply so just adding that for categorical features another route would likely be imputing the mode (mean/median). Still for Ordinal Encoding with a feature that has a lot of missing values, it's common to make this its own level.

Encoding of categorical variables and imputing

Java Scoring Code Call from Python

How to stop uploading

How do I upload a JDBC driver

Paxata Cache Folder

how to transform the var type in workbench