Solved: Custom Data split - DataRobot Community

DRsomesh · ‎10-11-2023

Hi Everyone,

Is it possible to pass customer train , test and validation datasets in DataRobot.

I have already made the data split and train , test and validation datasets are available and I would like to train ML algos in DataRobot using only my custom training data that is already available.

Could you please let me know how to accomplish it using Python API ?

corykind · ‎10-13-2023

Hi DRsomesh,

Yes, you can absolutely do this within DataRobot. If you are using the UI, you can provide a Partition Feature to DataRobot. That's effectively just an additional column in your data that contains your classifications of the train, test, and validation partitions you want to use. When you tell DataRobot to use that feature for partitioning, it will set up the TVH (train validation holdout) structure in DataRobot according to the partitions that you've provided. There is more documentation available on that here: Partitioning Options via the UI.

Since you want to use the API, you can do the same thing by passing the "partitioning_method" parameter to project.analyze_and_model() or project.start(). API docs on that here: Partitioning Method API Docs.

I've provided some example code to do this below. Note the partitioning_method parameter that I am passing to project.analyze_and_model() - I'm specifying the column in which I've defined my custom partitions as well as which entry maps to each of the Train-Validation-Holdout categories that DataRobot expects. If you don't want to use a Holdout, just pass "None".

# Create and start project
project_name = "Customer Churn with Custom Partitions"

project = Project.create(sourcedata=data, project_name=project_name)

# Set target with the name of the feature used for predictions
target = "Churn"
project.analyze_and_model(
target=target,
metric=ACCURACY_METRIC.AUC,
worker_count=-1,
partitioning_method = dr.UserTVH(user_partition_col = "CustomPartition", training_level = "Train",
validation_level = "Validation", holdout_level = "Test")
)

View solution in original post

corykind · ‎10-13-2023

Hi DRsomesh,

Yes, you can absolutely do this within DataRobot. If you are using the UI, you can provide a Partition Feature to DataRobot. That's effectively just an additional column in your data that contains your classifications of the train, test, and validation partitions you want to use. When you tell DataRobot to use that feature for partitioning, it will set up the TVH (train validation holdout) structure in DataRobot according to the partitions that you've provided. There is more documentation available on that here: Partitioning Options via the UI.

Since you want to use the API, you can do the same thing by passing the "partitioning_method" parameter to project.analyze_and_model() or project.start(). API docs on that here: Partitioning Method API Docs.

I've provided some example code to do this below. Note the partitioning_method parameter that I am passing to project.analyze_and_model() - I'm specifying the column in which I've defined my custom partitions as well as which entry maps to each of the Train-Validation-Holdout categories that DataRobot expects. If you don't want to use a Holdout, just pass "None".

# Create and start project
project_name = "Customer Churn with Custom Partitions"

project = Project.create(sourcedata=data, project_name=project_name)

# Set target with the name of the feature used for predictions
target = "Churn"
project.analyze_and_model(
target=target,
metric=ACCURACY_METRIC.AUC,
worker_count=-1,
partitioning_method = dr.UserTVH(user_partition_col = "CustomPartition", training_level = "Train",
validation_level = "Validation", holdout_level = "Test")
)

DRsomesh · ‎10-23-2023

Thanks @corykind for your reply

Custom Data split

Custom Data split

Modeling of many products - How to efficiently cre...

wrangling flag

Many Fold CV

OTV Partitioning

Dataset split