During project initialization I would like to to provide `datarobot.Project.set_label()` with a partition that is grouped and stratified, essentially sklearn.model_selection.StratifiedGroupKFold.
Does anyone know if it is possible to write a custom partition object, how/if it is possible to use datarobot.StratifiedCV in the context of sample grouping or how/if it possible to use datarobot.GroupCV in the context of class stratification?
datarobot.UserTVH is not powerful enough, as it seems to not allow for stratification or grouping within the training CVs, that is if you still get CV during training. I cannot really tell from the documentation if there is still CV during the training if you use datarobot.UserTVH, or if the hyperparams selected are simply those yielding the highest score on the validation set when the model is trained once on the designated set. Either way, it doesn't do what I need.
Solved! Go to Solution.
In the most simple and most common case, where you only want to stratify by your (binary) target, that comes out of the box.
If you use python, you can find the relevant info here: https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.28.0/autodoc/api_reference.html?hig...
Now if you want something more customised, then the easiest solution would be to add your partitions as just another column to your data, and then use https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.28.0/autodoc/api_reference.html?hig... to partition the data.
To answer your last question: DataRobot does indeed use an internal split to tune hyperparams, it does not use the validation data for that.
More information about partitioning in DataRobot from the docs: https://docs.datarobot.com/en/docs/modeling/reference/model-detail/data-partitioning.html
Hope that helps
So in datarobot.UserCV what is the `seed` parameter for? As I understand the it, `user_partition_col` is a vector generated before or at the time of project creation/data upload containing the fold assignments (presumably as zero-index integers, but it does not say??), where the holdout set is designated according to a value provided to `cv_holdout_level`. Because `user_partion_col` exists, we do not need to do any data splitting on DR's servers, so what could the `seed` argument be doing other than existing for compatibility reasons?
If then, the only randomization taking place when datarobot.UserCV is used is what the user generates and inputs as `user_partition_col`, doesn't this knee-cap a bunch of DR features? I thought changing the random seed on stuff like CV folds and examining the difference between the models generated by each seed is a pretty prime use case for DR?
To your point about the internal CV, are you sure it always happens? If some sort of internal CV is done, datarobot.UserCV would generate the wrong results, as the splitter would have no idea how to properly stratify and group.
My data is like the table below, each CV fold needs to have the same/very similar proportion of `Class` while also keeping each subject's replicates together, that way the model will be more likely to learn something about the class, rather than matching patient replicates to each other. The data below is very simple, in the actual data, each of the person's 'test replicates' are more like a reference result contaminated with some complex noise, or versions of an image generated during image augmentation as part of an image recognition model's training loop. Does it make more sense why the DR solutions do not seem to address what I need?
|Class||Subject||test replicate||Covid Test subresult#1||Covid Test subresult#2|