cancel
Showing results for 
Search instead for 
Did you mean: 

Does Data Robot actually ignore unknown values?

Bruce
Micro Servo

Does Data Robot actually ignore unknown values?

Several recent issues I have been having with building a model through Data Robot have been related to uncertainty in my mind as to how Data Robot handles missing values. At first I thought it simply did not. But, I am told it does. But, now I am unclear about what exactly it does do - and it makes a difference to how I should interpret the results.

 

I have data about loans. Some of these loans have a property, call it X, with a numerical value - and others do not. When the loan does have the property X, I would like DR to use it in the prediction. But, when the loan does not have property X, I don't want it to impute some value - I want it simply not to use it in making any prediction. So, for example, it should never turn up in an explanation. The data X is not missing in the sense of bad recording, it simply does not apply. 

 

My investigation so far suggests it will replace it with -9999 and then try to do numerical partitions on it, including that value. And that if I include a column (or even if DR makes such a column) that indicates that the value is invalid, that DR will not actually recognize the idea that the value X should not be used when the X-is-valid column is false. 

 

This seems to suggest that my best response is to insert a value that is less than the most negative number in the column, myself, so that partitions constructed by data robot will have the option to exclude that value using an inequality. And then to ignore entries of this nature when they appear in the explanations.

 

Am I on the right track here?

 

In general, I get the feeling that DR does not actually handle very well columns that indicate a change in the significance of another column. For example a physical length recorded as a numerical value and a unit - DR will not realize that the numerical values can have different scales.

 

Labels (1)
4 Replies
Linda
DataRobot Alumni

Hey @Bruce - Thanks for your question. While you wait for a more complete response, I wanted to drop a link to the documentation - just in case you haven't seen this already.  https://docs.datarobot.com/en/docs/data/analyze-data/data-quality.html#disguised-missing-values

Maybe there's something in there that gives you some insight? 

The rest of the help will be up to the community!

- Linda

mpkrass7
Data Scientist
Data Scientist

Hi Bruce,

 

There are a couple of points I want to address and some points where I could use some clarification.

 

Does DataRobot actually ignore unknown values?

No. The only time DataRobot will ignore a missing value is at training time if the target is missing. In that case, DataRobot will drop the record all together and not use it in training the model. Otherwise at training time, DataRobot will impute missing values. There are a number of ways that it can do this (it's actually a customizable feature within our blueprints). I often see the platform do two things:

  1. Impute using the median value from the training data
  2. Create a new column that flags a variable as 'imputed'

At prediction time, DataRobot will also do this and if the value being missing contains signal, it very well could show up in the prediction explanations.

 

When the loan does not have property X, I don't want it to impute some value - I want it simply not to use it in making any prediction. So, for example, it should never turn up in an explanation. The data X is not missing in the sense of bad recording, it simply does not apply. 

Could you clarify a little bit more on how being missing does not apply to the output? If you had missing values in your training data for this dataset, DataRobot is likely picking up signal from values being missing in some cases and the model should use them just like any other feature in your dataset.

 

If you did not have any missing values in your training data but you do have missing values when you make predictions, that suggests that your training data is not representative of your scoring data. In that case, you should not be making predictions at all on data with missing values. Instead, why not just drop rows where X is NA? DataRobot will let you make predictions on them, and impute values for that feature using the median (the best it can do is take a guess at what the value should be), but better practice would be to make sure your training data is representative of your scoring data.

 

This seems to suggest that my best response is to insert a value that is less than the most negative number in the column, myself, so that partitions constructed by data robot will have the option to exclude that value using an inequality. And then to ignore entries of this nature when they appear in the explanations.

 

If you are dead set on not using these values in predictions when they are missing, your best response is not to impute a value yourself. In many model types a small value could have as much significance as a large value. Instead, you could make two models where one is trained on data where feature X is not missing (model A) and one where that does not include feature X at all (model B). Then, at prediction time, if feature X is present you would request predictions from Model A. Otherwise, you would request predictions from Model B. I'm giving this as a practical option for your question but I do not suggest you do this. A model that is trained on a feature always uses the feature in some way at prediction time unless it is dropped entirely in the feature reduction process.


In general, I get the feeling that DR does not actually handle very well columns that indicate a change in the significance of another column. For example a physical length recorded as a numerical value and a unit - DR will not realize that the numerical values can have different scales.

 

DataRobot actually can be pretty good at picking up relationships between two columns, especially with tree based models and models that look for interaction terms. It will likely perform much better, however if a numerical column is always represented in the same unit since it won't have to find those relationships on its own. 

 

Hope that helps!

Hi @mpkrass7 

 

Thanks, that was quite informative and I feel that it qualifies as an answer.

I remain dubious about Data Robot finding this kind of relation between columns, my experience indicates that DR does not. It might depend on the type of data involved.

I will wait before coming to any definite conclusion.

The field when invalid originally had random data in it. I have changed that. But, DR seems to be more affected by the field having only the single value, than in there being another column that is indicating that the original field is not useful.

I will, when I get a chance, run two projects, one for the data with the value and one for without. At least then I can be certain that DR is not using the information. 

I think that this will require me to think a lot more about exactly one one is supposed to do with missing data anyway. There is something here that I do not feel comfortable with.

------------

On the please clarify issue:

The data being used to train is identical in nature to the data being used to predict plus the target column. Otherwise, as you said - I would not be trying to do predictions in this manner.

Consider a database of songs in which there is a field: how popular was this song on the radio last week. Floating point numerical score from -1 to 1. 1 meaning it was loved and -1 meaning it was hated.  If the song was never played last week - then any numerical score is misleading. So, in terms of predicting how popular it will be on the radio next week, what we really want to do is to ignore that column when it was not played. 

Of course, if machine learning engines were in a perfect world, then you could give a score of NA and the engine would sort it out. But throwing out any NAs in this case is not the right answer either. We do not live in a perfect world.

 

 

 

 

0 Kudos

@Linda 

Note: I tried to respond to your message - but the post button did not present itself. So, here is my response, thanks.

 

Thanks, Linda - it has been worthwhile for me to be involved in this question and answer process on the community. I seem to have a different background to most of those responding to my questions. And it is often hard to use their suggestions directly. But, the process of reading and writing has often clarified in my own mind what is going on - and especially what is assumed and is not assumed in this profession. Regards, Bruce.

 

[For example, in the accepted answer the term "relation" between the column was misunderstood. Interaction terms are not the type of relation I am talking about. Interaction terms, as I understand them anyway, are just automatically generated product terms of multiple columns].