cancel
Showing results for 
Search instead for 
Did you mean: 

Encoding of categorical variables and imputing

Encoding of categorical variables and imputing

Hello,

We have built a binary classification model(0/1 problem). While going through the Data quality handling report, I could see that some categorical variables have been treated in a particular way. Plz see screenshot.

 

Screen Shot 2021-08-04 at 4.18.03 PM.png

 

I have two questions here;

 

  • First, regarding imputation. Why are we seeing imputed value:-2 here. I thought that here the value used to impute is missing.
  • Second, we can see an ordinal encoding done here. Is that the preferred way the models encode on this platform? Is there a way we can choose to perform one-hot encoding here? thanks 
3 Replies
Linda
DataRobot Alumni

Hi @Jayant 

Have you had a look at the product documentation? Maybe you can get some some answers there, at least before the community reaches out with help.

 

Here's doc information for the report and imputation

and for one-hot encoding.

Also this Data Quality article.

- linda

 

There's also a 

Jameson
DataRobot Alumni

Hi @Jayant ,

 

(1) I can't see what type of algorithm is being run from the screenshot, but if it's using Ordinal Encoding I believe it's likely a tree base model. Regardless when using Ordinal Encoding the missing values can be mapped to an arbitrary level which lets the tree see this as potentially useful information. If we simply impute the median/mean value then the fact that it was missing could be lost. So it's common in our experience to impute values like -2 or -9999 for tree based models with Ordinal Encoding.

 

(2) That said, with Composable ML (https://www.datarobot.com/platform/composable-ml/) you have the ability to change this behavior and swap out one type of encoding method for another. So you could change Ordinal Encoding to OHE and see how it performs.

 

Hope this helps! Let us know if you have any follow up thoughts/questions.

 

Jameson

P.S. Can't seem to edit my reply so just adding that for categorical features another route would likely be imputing the mode (mean/median). Still for Ordinal Encoding with a feature that has a lot of missing values, it's common to make this its own level. 

0 Kudos