cancel
Showing results for 
Search instead for 
Did you mean: 

High cardinality categorical variables

Highlighted
Data Scientist
Data Scientist

I received this question from a DataRobot user, and it was such a good one one I asked and received permission to post it here along with my initial thoughts. I'd love for the community to chime in as well! 

 

"Say I have a categorical/integer variable that has a large cardinality, say 100 or 150, and if I could group them into say 5 to 10 buckets without too much loss in signals, from the platform perspective would I be better off grouping them into the smaller cardinality set vs leaving them as they are?"

2 Replies
Highlighted
Data Scientist
Data Scientist

Here is my initial response:

 

"

  1. My first question would be: do you expect to see all of those categoricals in future data? If not, then you can probably eliminate / group those that are unlikely to occur in the future and that might improve generalization 
  2. In general, our approach at DataRobot is to first let the data speak for itself. Tree based models, e.g. XGBoost, are extremely good at forming their own buckets
    • the nice thing about DataRobot is that we will employ several different categorical encoding strategies that will let trees try breaking them into different buckets. E.g. we employ a derivation of "leave one out" encoding which essentially transforms each categorical into its target frequency. This would allow the trees to bucket the categorical values based on their frequency with the target 
  3. However, another DataRobot philosophy is try out all approaches to see what works best! 
  4. Another question I'd ask is whether the categoricals form a natural hierarchy (e.g. grouping Ford F-150, Toyota Tundra, etc. into the higher level category "light trucks")?
  5. Another thing to consider is the distribution across the different categorical values. If 90% of the samples are concentrated in 10-20 of the categorical values, than it might make sense to merge those sparse categorical values together into a higher level group
This question led me to do some research, and eventually I stumbled upon this thorough article discussing the topic: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-o... . "
 
 
 
Highlighted
Data Scientist
Data Scientist
Great and enlightening article and response! If you use R, have a look at this too: https://cran.r-project.org/web/packages/greenclust/index.html I don't know if Greenacre's method is consistently any better than any other, but nevertheless for nostalgic reasons I have a look at it every now and again
0 Kudos