I received this question from a DataRobot user, and it was such a good one one I asked and received permission to post it here along with my initial thoughts. I'd love for the community to chime in as well!
"Say I have a categorical/integer variable that has a large cardinality, say 100 or 150, and if I could group them into say 5 to 10 buckets without too much loss in signals, from the platform perspective would I be better off grouping them into the smaller cardinality set vs leaving them as they are?"
My first question would be: do you expect to see all of those categoricals in future data? If not, then you can probably eliminate / group those that are unlikely to occur in the future and that might improve generalization
In general, our approach at DataRobot is to first let the data speak for itself. Tree based models, e.g. XGBoost, are extremely good at forming their own buckets
the nice thing about DataRobot is that we will employ several different categorical encoding strategies that will let trees try breaking them into different buckets. E.g. we employ a derivation of "leave one out" encoding which essentially transforms each categorical into its target frequency. This would allow the trees to bucket the categorical values based on their frequency with the target
However, another DataRobot philosophy is try out all approaches to see what works best!
Another question I'd ask is whether the categoricals form a natural hierarchy (e.g. grouping Ford F-150, Toyota Tundra, etc. into the higher level category "light trucks")?
Another thing to consider is the distribution across the different categorical values. If 90% of the samples are concentrated in 10-20 of the categorical values, than it might make sense to merge those sparse categorical values together into a higher level group