Solved: Does Data Robot actually ignore unknown values? - DataRobot Community

Bruce · ‎02-23-2022

Several recent issues I have been having with building a model through Data Robot have been related to uncertainty in my mind as to how Data Robot handles missing values. At first I thought it simply did not. But, I am told it does. But, now I am unclear about what exactly it does do - and it makes a difference to how I should interpret the results.

I have data about loans. Some of these loans have a property, call it X, with a numerical value - and others do not. When the loan does have the property X, I would like DR to use it in the prediction. But, when the loan does not have property X, I don't want it to impute some value - I want it simply not to use it in making any prediction. So, for example, it should never turn up in an explanation. The data X is not missing in the sense of bad recording, it simply does not apply.

My investigation so far suggests it will replace it with -9999 and then try to do numerical partitions on it, including that value. And that if I include a column (or even if DR makes such a column) that indicates that the value is invalid, that DR will not actually recognize the idea that the value X should not be used when the X-is-valid column is false.

This seems to suggest that my best response is to insert a value that is less than the most negative number in the column, myself, so that partitions constructed by data robot will have the option to exclude that value using an inequality. And then to ignore entries of this nature when they appear in the explanations.

Am I on the right track here?

In general, I get the feeling that DR does not actually handle very well columns that indicate a change in the significance of another column. For example a physical length recorded as a numerical value and a unit - DR will not realize that the numerical values can have different scales.

mpkrass7 · ‎02-24-2022

Hi Bruce,

There are a couple of points I want to address and some points where I could use some clarification.

Does DataRobot actually ignore unknown values?

No. The only time DataRobot will ignore a missing value is at training time if the target is missing. In that case, DataRobot will drop the record all together and not use it in training the model. Otherwise at training time, DataRobot will impute missing values. There are a number of ways that it can do this (it's actually a customizable feature within our blueprints). I often see the platform do two things:

Impute using the median value from the training data
Create a new column that flags a variable as 'imputed'

At prediction time, DataRobot will also do this and if the value being missing contains signal, it very well could show up in the prediction explanations.

When the loan does not have property X, I don't want it to impute some value - I want it simply not to use it in making any prediction. So, for example, it should never turn up in an explanation. The data X is not missing in the sense of bad recording, it simply does not apply.

Could you clarify a little bit more on how being missing does not apply to the output? If you had missing values in your training data for this dataset, DataRobot is likely picking up signal from values being missing in some cases and the model should use them just like any other feature in your dataset.

If you did not have any missing values in your training data but you do have missing values when you make predictions, that suggests that your training data is not representative of your scoring data. In that case, you should not be making predictions at all on data with missing values. Instead, why not just drop rows where X is NA? DataRobot will let you make predictions on them, and impute values for that feature using the median (the best it can do is take a guess at what the value should be), but better practice would be to make sure your training data is representative of your scoring data.

This seems to suggest that my best response is to insert a value that is less than the most negative number in the column, myself, so that partitions constructed by data robot will have the option to exclude that value using an inequality. And then to ignore entries of this nature when they appear in the explanations.

If you are dead set on not using these values in predictions when they are missing, your best response is not to impute a value yourself. In many model types a small value could have as much significance as a large value. Instead, you could make two models where one is trained on data where feature X is not missing (model A) and one where that does not include feature X at all (model B). Then, at prediction time, if feature X is present you would request predictions from Model A. Otherwise, you would request predictions from Model B. I'm giving this as a practical option for your question but I do not suggest you do this. A model that is trained on a feature always uses the feature in some way at prediction time unless it is dropped entirely in the feature reduction process.

In general, I get the feeling that DR does not actually handle very well columns that indicate a change in the significance of another column. For example a physical length recorded as a numerical value and a unit - DR will not realize that the numerical values can have different scales.

DataRobot actually can be pretty good at picking up relationships between two columns, especially with tree based models and models that look for interaction terms. It will likely perform much better, however if a numerical column is always represented in the same unit since it won't have to find those relationships on its own.

Hope that helps!

View solution in original post

Linda · ‎02-24-2022

Hey @Bruce - Thanks for your question. While you wait for a more complete response, I wanted to drop a link to the documentation - just in case you haven't seen this already. https://docs.datarobot.com/en/docs/data/analyze-data/data-quality.html#disguised-missing-values

Maybe there's something in there that gives you some insight?

The rest of the help will be up to the community!

- Linda

mpkrass7 · ‎02-24-2022

Hi Bruce,

There are a couple of points I want to address and some points where I could use some clarification.

Does DataRobot actually ignore unknown values?

No. The only time DataRobot will ignore a missing value is at training time if the target is missing. In that case, DataRobot will drop the record all together and not use it in training the model. Otherwise at training time, DataRobot will impute missing values. There are a number of ways that it can do this (it's actually a customizable feature within our blueprints). I often see the platform do two things:

Impute using the median value from the training data
Create a new column that flags a variable as 'imputed'