Ignoring columns

Ignoring columns

I realize in writing this that is really several questions -

but they are related so I will make it one post.

 

Everything in the post is asking about the Python API to Data Robot.

 

1. Can I get Data Robot to ignore columns in the provided data in training the model.

 

The answer seems to be feature lists, but I am a bit fuzzy on how to use them. If I start a project, I have to give a target - but oddly, I cannot give a feature list. The only way I know how to hand over a feature list to a project is using set_target, which complains if I specify the same target that I did when I created the project. Neither allow me to specify a blank target. 

 

Addendum: part of the answer seems to be create project, which does not require a target, set target, which requires a target and you can give features, start project, which requires a target - but will accept the previously supplied target specified again. That feels like - you can specify multiple targets but when you start a project it will create it if it does not exist. This was probably intended as a good thing.  [later] This does not entirely work, as it appears to create two projects when I do it. Trying to create a project with auto-pilot turned off is part of the issue.

 

2. Can I get Data Robot to not ignore a column, even if it thinks it is target leakage.

 

3. Can I get Data Robot to include the independent variables in the output prediction table - so that I do not have to stitch them back together again, at the risk of introducing error or at least doubt into the situation?

 

 

 

 

14 Replies

@Bruce Great point I didn't think of that! Setting autopilot_on to be false would allow you to then set up a feature list like I did above then begin modelling. 

Glad you found some of the information useful. There are a few good articles on the community on Batch Prediction and also on uploading actual results to measure deployment accuracy, both worth a look if you haven't seen them. Also just to let you know the DR community helpfully lets you accept multiple answers, I think @Eu Jin's answer was more complete then mine so feel free to tick it also 😄.   

Not any time soon as we try to get customers to go through the deployment section and use the BatchPrediction capability like @IraWatt mentioned. 

0 Kudos

My overall conclusion fwiw is that it is not worth it. I have converted the code to condition the data to contain only the target and the data intended to be used in the training. And I add a row number to a static copy of the table so that I can join this on that column with the predictions returned by Data Robot.

0 Kudos

For a more concise response to (1) and (2):

  1. Projects are defined in the context of a target variable;
  2. Blueprints, and by extension Autopilot, are defined in the context of a featurelist.

I would caveat your second question by saying that if you want to force the feature to be considered at the algorithm-level (e.g. you don't want the feature to be lasso-ed away), I am not aware of a way to do so. If that is the goal, you're probably better off blending with a model built with only that feature.

For (3), a slightly cumbersome option perhaps: download the JAR Scoring Code, which has a passthrough_columns parameter. Write your table to a temporary csv, call Java in the command line and get back your original columns plus predictions in the resulting csv output. Not the best of options I suppose, but it works and avoids cluttering the server with prediction datasets.

0 Kudos