We have built a binary classification model(0/1 problem). While going through the Data quality handling report, I could see that some categorical variables have been treated in a particular way. Plz see screenshot.
I have two questions here;
First, regarding imputation. Why are we seeing imputed value:-2 here. I thought that here the value used to impute is missing.
Second, we can see an ordinal encoding done here. Is that the preferred way the models encode on this platform? Is there a way we can choose to perform one-hot encoding here? thanks
(1) I can't see what type of algorithm is being run from the screenshot, but if it's using Ordinal Encoding I believe it's likely a tree base model. Regardless when using Ordinal Encoding the missing values can be mapped to an arbitrary level which lets the tree see this as potentially useful information. If we simply impute the median/mean value then the fact that it was missing could be lost. So it's common in our experience to impute values like -2 or -9999 for tree based models with Ordinal Encoding.
(2) That said, with Composable ML (https://www.datarobot.com/platform/composable-ml/) you have the ability to change this behavior and swap out one type of encoding method for another. So you could change Ordinal Encoding to OHE and see how it performs.
Hope this helps! Let us know if you have any follow up thoughts/questions.
P.S. Can't seem to edit my reply so just adding that for categorical features another route would likely be imputing the mode (mean/median). Still for Ordinal Encoding with a feature that has a lot of missing values, it's common to make this its own level.