cancel
Showing results for 
Search instead for 
Did you mean: 

Why isn’t all my data used to determine top features?

Why isn’t all my data used to determine top features?

I'm assuming this is the right place for this question. The Datarobot UI help doc says feature impact runs on a sample size of 2500 rows instead of the whole set of learning data. And I see a on-screen message saying a custom sample size of 2,500 rows was used. Why doesn’t it run on all of my learning data when determining the top features? I believe I want it to get the top features by looking at all my data, but would like to hear the reason it doesn’t.

1 Solution

Accepted Solutions

My guess would be that the hard-coded 2500 row limit is solely for reducing computational time when training multiple models. After finding the most promising models, I usually recalculate the feature Impact for a larger sample size after narrowing down the best models; This can be done by using the "Adjust Sample Size" on the bottom of the Feature Impact window. 
Note that the 2500 row limit will not have an effect on the performance of the models when ranking the models against eachother.

View solution in original post

2 Replies

My guess would be that the hard-coded 2500 row limit is solely for reducing computational time when training multiple models. After finding the most promising models, I usually recalculate the feature Impact for a larger sample size after narrowing down the best models; This can be done by using the "Adjust Sample Size" on the bottom of the Feature Impact window. 
Note that the 2500 row limit will not have an effect on the performance of the models when ranking the models against eachother.

Really appreciate the help, @Trygvi -  this approach makes complete sense

0 Kudos