cancel
Showing results for 
Search instead for 
Did you mean: 

Id Column of the Iris Dataset Not Automatically Removed

Id Column of the Iris Dataset Not Automatically Removed

In the process of learning the working of the model evaluation, I loaded the Iris dataset to AI Catalogue.  The Id column was not removed as I expected DataRobot to remove the column before running the models.  To my surprise, the Id was spotted as the MOST IMPORTANT feature for the prediction! 

I am about to rerun it with the Id column removed.  Without understanding the logic of the automation, I feel very uncomfortable as to how far my pre-processing should go before handing the rest of the work to DataRobot.  I am a new comer so I might be asking something that everyone here knows already.

Labels (1)
0 Kudos
2 Solutions

Accepted Solutions

I tried this myself with the Kaggle dataset.  I think there is a combination of reasons for this; as DataRobot typically identifies things like surrogate keys/primary key identifiers and excludes them from feature lists used for training models.  In this case, I think the set being both quite short (150 lines) and quite slim (6 columns - one being the ID and one being the target) led to the default logic including it.  As far as it being a strong signal in the model - the Kaggle data appears to be sorted in order of the target label, then to have had the ID applied.  So even a casual glance at this dataset would lead one to create the rule "if ID <= 50, Iris-setosa"!

Under the Data tab - you can create your own feature list in an existing project.  Select the 4 actual features here, then hit +Create feature list.  You could also do this at the start of a new project; choose Species, create the new custom feature list, then kick off the autopilot.

View solution in original post

@danielkcchan - I see @doyouevendata 's reply.

Here's the UI for creating a new feature list. The new list you create will contain only the features you've selected.

lhaviland_0-1615834795662.png

Here's a community article that explains feature lists and creating new ones:
https://community.datarobot.com/t5/resources/feature-lists/ta-p/1825

View solution in original post

4 Replies

I tried this myself with the Kaggle dataset.  I think there is a combination of reasons for this; as DataRobot typically identifies things like surrogate keys/primary key identifiers and excludes them from feature lists used for training models.  In this case, I think the set being both quite short (150 lines) and quite slim (6 columns - one being the ID and one being the target) led to the default logic including it.  As far as it being a strong signal in the model - the Kaggle data appears to be sorted in order of the target label, then to have had the ID applied.  So even a casual glance at this dataset would lead one to create the rule "if ID <= 50, Iris-setosa"!

Under the Data tab - you can create your own feature list in an existing project.  Select the 4 actual features here, then hit +Create feature list.  You could also do this at the start of a new project; choose Species, create the new custom feature list, then kick off the autopilot.

@danielkcchan - I see @doyouevendata 's reply.

Here's the UI for creating a new feature list. The new list you create will contain only the features you've selected.

lhaviland_0-1615834795662.png

Here's a community article that explains feature lists and creating new ones:
https://community.datarobot.com/t5/resources/feature-lists/ta-p/1825

Actually, I already did that yesterday.  Thanks for the input.

0 Kudos

Yes, I observed the same pattern in the dataset.  I believe the pattern favours modelling training but would be a disaster when it comes to making prediction.  This default logic is very worrying to me.  I would much prefer this to be flagged during EDA1 rather than me discovering it after running the models and examining the outcomes.  In any case, thanks for taking the trouble to try out the dataset and I really appreciate that.

 

0 Kudos