Solved: Re: Ignoring columns - DataRobot Community

Bruce · ‎11-14-2021

I realize in writing this that is really several questions -

but they are related so I will make it one post.

Everything in the post is asking about the Python API to Data Robot.

1. Can I get Data Robot to ignore columns in the provided data in training the model.

The answer seems to be feature lists, but I am a bit fuzzy on how to use them. If I start a project, I have to give a target - but oddly, I cannot give a feature list. The only way I know how to hand over a feature list to a project is using set_target, which complains if I specify the same target that I did when I created the project. Neither allow me to specify a blank target.

Addendum: part of the answer seems to be create project, which does not require a target, set target, which requires a target and you can give features, start project, which requires a target - but will accept the previously supplied target specified again. That feels like - you can specify multiple targets but when you start a project it will create it if it does not exist. This was probably intended as a good thing. [later] This does not entirely work, as it appears to create two projects when I do it. Trying to create a project with auto-pilot turned off is part of the issue.

2. Can I get Data Robot to not ignore a column, even if it thinks it is target leakage.

3. Can I get Data Robot to include the independent variables in the output prediction table - so that I do not have to stitch them back together again, at the risk of introducing error or at least doubt into the situation?

IraWatt · ‎11-15-2021

Hey @Bruce ,

Looking at your first question, one approach may be to start a project with a dataset. Then use the 'create_featurelist' function to create a feature list and pass its id when starting the modelling process.

import datarobot as dr

dataset = Dataset.create_from_file(file_path='/home/user/data/last_week_data.csv')
project = dr.Project.create_from_dataset(dataset.id, project_name='New Project')
project.set_target(target='feature 1')

featurelist = project.create_featurelist('test', ['feature 1', 'feature 2'])
project.start_autopilot(featurelist.id)

View solution in original post

Eu Jin · ‎11-15-2021

Hey @Bruce

First of all thanks @IraWatt for the quick response to the questions! Looks like you've covered the feature list question and the ignore target leakage which is awesome! I'll add another variant to creating the feature list:

featurelist = ['Type', 'Price', 'Distance', 'Bedroom2', 'Bathroom', 'Car']
featurelist = project.create_featurelist('EJL features', list(featurelist))

project.set_target(
    target = 'Price'
    , featurelist_id = featurelist.id
    , mode = dr.AUTOPILOT_MODE.QUICK
    , worker_count = -1
    )

On the last one @Bruce you can get DataRobot to pass back all the independent features (or even features that are not used at all) but only if you have deployed the model in MLOps. Currently the only way to do it via the Modelling workers is through the GUI. There's no support for the API yet. Here's a very similar question that was asked a few months back here

Eu Jin

View solution in original post

IraWatt · ‎11-15-2021

Hey @Bruce ,

Looking at your first question, one approach may be to start a project with a dataset. Then use the 'create_featurelist' function to create a feature list and pass its id when starting the modelling process.

import datarobot as dr

dataset = Dataset.create_from_file(file_path='/home/user/data/last_week_data.csv')
project = dr.Project.create_from_dataset(dataset.id, project_name='New Project')
project.set_target(target='feature 1')

featurelist = project.create_featurelist('test', ['feature 1', 'feature 2'])
project.start_autopilot(featurelist.id)

IraWatt · ‎11-15-2021

On Target leakage the 'Run_leakage_removed_feature_list' is a parameter in DataRobots advanced options which would allow you to ignore any target leakage detected and just run the feature lists you pass over for modelling.

import datarobot as dr
advanced_options = dr.AdvancedOptions(run_leakage_removed_feature_list=False)

If you wanted to do this at a feature level you can edit this in the Feature method using the target_leakage parameter.

Bruce · ‎11-15-2021

Hi @IraWatt, thanks.

The basic flow sounds more like what I was hoping for.

Just one question - what about starting it with autopilot off? just project.start() ?

Bruce.

IraWatt · ‎11-15-2021

Good question had a quick check of the API Docs project.start doesn't mention any parameters you could use for this that I can see.

IraWatt · ‎11-15-2021

I'm not sure how you are doing your predictions but the BatchPredictionJob function has a 'passthrough_columns' parameter which may be helpful. I have not used this parameter so I'll give it a test.

Eu Jin · ‎11-15-2021

Hey @Bruce

First of all thanks @IraWatt for the quick response to the questions! Looks like you've covered the feature list question and the ignore target leakage which is awesome! I'll add another variant to creating the feature list:

featurelist = ['Type', 'Price', 'Distance', 'Bedroom2', 'Bathroom', 'Car']
featurelist = project.create_featurelist('EJL features', list(featurelist))

project.set_target(
    target = 'Price'
    , featurelist_id = featurelist.id
    , mode = dr.AUTOPILOT_MODE.QUICK
    , worker_count = -1
    )

On the last one @Bruce you can get DataRobot to pass back all the independent features (or even features that are not used at all) but only if you have deployed the model in MLOps. Currently the only way to do it via the Modelling workers is through the GUI. There's no support for the API yet. Here's a very similar question that was asked a few months back here

Eu Jin

Bruce · ‎11-15-2021

Thanks @IraWatt -- that gave me a collection of ideas. I don't know yet what road-blocks I will run into with the batch predictions, but your suggestions have given me a path to explore.

Bruce · ‎11-15-2021

@Eu Jin Will this be available through the Python API soon?

Bruce · ‎11-15-2021

@IraWatt the start function has "autopilot_on" which can be set to false.

Ignoring columns

Ignoring columns

Paxata Cache Folder

how to transform the var type in workbench

Understanding Model

Time Series Modelling

Trial Walkthrough Issue