High cardinality categorical variables

duncanrenfrow · ‎04-27-2020

I received this question from a DataRobot user, and it was such a good one one I asked and received permission to post it here along with my initial thoughts. I'd love for the community to chime in as well!

"Say I have a categorical/integer variable that has a large cardinality, say 100 or 150, and if I could group them into say 5 to 10 buckets without too much loss in signals, from the platform perspective would I be better off grouping them into the smaller cardinality set vs leaving them as they are?"

duncanrenfrow · ‎04-27-2020

Here is my initial response:

"

My first question would be: do you expect to see all of those categoricals in future data? If not, then you can probably eliminate / group those that are unlikely to occur in the future and that might improve generalization
In general, our approach at DataRobot is to first let the data speak for itself. Tree based models, e.g. XGBoost, are extremely good at forming their own buckets

the nice thing about DataRobot is that we will employ several different categorical encoding strategies that will let trees try breaking them into different buckets. E.g. we employ a derivation of "leave one out" encoding which essentially transforms each categorical into its target frequency. This would allow the trees to bucket the categorical values based on their frequency with the target

However, another DataRobot philosophy is try out all approaches to see what works best!
Another question I'd ask is whether the categoricals form a natural hierarchy (e.g. grouping Ford F-150, Toyota Tundra, etc. into the higher level category "light trucks")?
Another thing to consider is the distribution across the different categorical values. If 90% of the samples are concentrated in 10-20 of the categorical values, than it might make sense to merge those sparse categorical values together into a higher level group

This question led me to do some research, and eventually I stumbled upon this thorough article discussing the topic: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-o... . "

oskareriksson · ‎04-28-2020

Great and enlightening article and response! If you use R, have a look at this too: https://cran.r-project.org/web/packages/greenclust/index.html I don't know if Greenacre's method is consistently any better than any other, but nevertheless for nostalgic reasons I have a look at it every now and again 🙂

High cardinality categorical variables

Data

Oracle

How to make your own lagged features

Google Ads use case

Feature Generation

Downloaded Predictions do not Match Targets