Solved: Re: Why isn’t all my data used to determine top fe... - DataRobot Community

sam632 · ‎11-22-2020

I'm assuming this is the right place for this question. The Datarobot UI help doc says feature impact runs on a sample size of 2500 rows instead of the whole set of learning data. And I see a on-screen message saying a custom sample size of 2,500 rows was used. Why doesn’t it run on all of my learning data when determining the top features? I believe I want it to get the top features by looking at all my data, but would like to hear the reason it doesn’t.

Trygvi · ‎11-23-2020

My guess would be that the hard-coded 2500 row limit is solely for reducing computational time when training multiple models. After finding the most promising models, I usually recalculate the feature Impact for a larger sample size after narrowing down the best models; This can be done by using the "Adjust Sample Size" on the bottom of the Feature Impact window.
Note that the 2500 row limit will not have an effect on the performance of the models when ranking the models against eachother.

View solution in original post

Trygvi · ‎11-23-2020

My guess would be that the hard-coded 2500 row limit is solely for reducing computational time when training multiple models. After finding the most promising models, I usually recalculate the feature Impact for a larger sample size after narrowing down the best models; This can be done by using the "Adjust Sample Size" on the bottom of the Feature Impact window.
Note that the 2500 row limit will not have an effect on the performance of the models when ranking the models against eachother.

sam632 · ‎11-24-2020

Really appreciate the help, @Trygvi - this approach makes complete sense

Why isn’t all my data used to determine top features?

Why isn’t all my data used to determine top features?

Paxata Cache Folder

how to transform the var type in workbench

Understanding Model

Time Series Modelling

Trial Walkthrough Issue