I'm developing a script that cycles through 7 x 5 x b model creations (where b is the number of blueprints) and evaluates them in terms of our metric as well as the which features have high impact, and then retraining the best ones on a higher sample size etc. It drags very slowly with respect to the need to test while building out its complexity.
I'm looking for ways to streamline this just for this testing/debugging phase. I've already whittled down to only 100 rows of data, only 3-5 features, 32 percent sample, and the fast training model types I could find so far (xgboost and Naive Bayes). Is there anything else I can do to speed up things, keeping in mind that at this point I don't care at all how well the models perform now; I'm just getting the plumbing in place.
Here are some advices:
1) Use AI Catalog to reduce the amount of uploads and reduce the risks of messing dataset
2) Reuse the same project to the maximum. For example, to check the performance of another feature list, you can create it in the same project and rerun autopilot on a new feature list with limits for partitions.
3) Name your projects and feature lists accordingly, add versions into their names, to track changes.
4) DataRobot doesn't guarantee the same validation split for different datasets, so preferably you should provide a consistent one by Partition feature
5) For tracking your performance in notebooks one may try using papermill and MLFlow python libraries
Thanks for your response Bogdan,
Actually I'm already doing (2) and (3). I'm not seeing upload time as a bottleneck, although this could become a factor; maybe for that I'll consider (1). In regard to (4), I'm using an outside holdout to measure all project models against. Within the projects, I'm thinking that CV within each project ensures that different splits in different projects won't matter much; but I'm looking at ways this partition feature may helpf. In regard to (5), I'll look at papermill and MLFlow, but could you elaborate on the specific application of these to my question?
I recommend 4) because - DataRobot has created many excellent tools for understanding the performance of models, instead of you spending time recreating this on your side. Anyway, you will be able to download Holdout predictions on your side to apply any custom metric you are interested in.
Use 1) because you always will have a track of your datasets. It is common to have something changed locally and skew version control of datasets.
5) Is a set of nice tools to keep reproducible results of your notebooks. So if you've made some changes that made the result worse than the previous, it will be easier to revert the changes.
While your workers are limited, you still can start multiple projects simultaneously, this might give you an option to run a lot of tests overnight, utilizing your workers to the maximum.