Solved: Re: Ignoring columns - Page 2 - DataRobot Community

Bruce · ‎11-14-2021

I realize in writing this that is really several questions -

but they are related so I will make it one post.

Everything in the post is asking about the Python API to Data Robot.

1. Can I get Data Robot to ignore columns in the provided data in training the model.

The answer seems to be feature lists, but I am a bit fuzzy on how to use them. If I start a project, I have to give a target - but oddly, I cannot give a feature list. The only way I know how to hand over a feature list to a project is using set_target, which complains if I specify the same target that I did when I created the project. Neither allow me to specify a blank target.

Addendum: part of the answer seems to be create project, which does not require a target, set target, which requires a target and you can give features, start project, which requires a target - but will accept the previously supplied target specified again. That feels like - you can specify multiple targets but when you start a project it will create it if it does not exist. This was probably intended as a good thing. [later] This does not entirely work, as it appears to create two projects when I do it. Trying to create a project with auto-pilot turned off is part of the issue.

2. Can I get Data Robot to not ignore a column, even if it thinks it is target leakage.

3. Can I get Data Robot to include the independent variables in the output prediction table - so that I do not have to stitch them back together again, at the risk of introducing error or at least doubt into the situation?

IraWatt · ‎11-16-2021

@Bruce Great point I didn't think of that! Setting autopilot_on to be false would allow you to then set up a feature list like I did above then begin modelling.

IraWatt · ‎11-16-2021

Glad you found some of the information useful. There are a few good articles on the community on Batch Prediction and also on uploading actual results to measure deployment accuracy, both worth a look if you haven't seen them. Also just to let you know the DR community helpfully lets you accept multiple answers, I think @Eu Jin's answer was more complete then mine so feel free to tick it also 😄.

Eu Jin · ‎11-16-2021

Not any time soon as we try to get customers to go through the deployment section and use the BatchPrediction capability like @IraWatt mentioned.

Bruce · ‎11-17-2021

My overall conclusion fwiw is that it is not worth it. I have converted the code to condition the data to contain only the target and the data intended to be used in the training. And I add a row number to a static copy of the table so that I can join this on that column with the predictions returned by Data Robot.

phi · ‎12-17-2021

For a more concise response to (1) and (2):

Projects are defined in the context of a target variable;
Blueprints, and by extension Autopilot, are defined in the context of a featurelist.

I would caveat your second question by saying that if you want to force the feature to be considered at the algorithm-level (e.g. you don't want the feature to be lasso-ed away), I am not aware of a way to do so. If that is the goal, you're probably better off blending with a model built with only that feature.

For (3), a slightly cumbersome option perhaps: download the JAR Scoring Code, which has a passthrough_columns parameter. Write your table to a temporary csv, call Java in the command line and get back your original columns plus predictions in the resulting csv output. Not the best of options I suppose, but it works and avoids cluttering the server with prediction datasets.

Ignoring columns

Ignoring columns

Paxata Cache Folder

how to transform the var type in workbench

Understanding Model

Time Series Modelling

Trial Walkthrough Issue