cancel
Showing results for 
Search instead for 
Did you mean: 

Passing null values through DataRobot API

Passing null values through DataRobot API

Hello,

Both of my training and scoring datasets contain null values for some numeric variables. In fact files have ? character in places where the data is missing. I see that DataRobot recognizes that and imputes with an arbitrary value (e.g. -9999).  Datarobot calculates predictions using scoring dataset with missing values (?) with no problems when I do it through Predict->Make Predictions option.

But when I deploy the model and try to pass the same scoring dataset using API I am getting error message

400 Error: {"message":"Column P1_XXXX was expected to be numeric"}

I tried replacing ? with null and with blank but still getting the same error. The dataset is UTF-8 comma delimited. So all these 3 combinations are triggering error message:

,?,?,

,null,null,

,,,

 

Any help would be appreciated.

Thanks

15 Replies

What if the data is log normal rather than normal. Then the median seems rather less relevant. Is there any justification for the median beyond contemporary data science standard practice? 

In non-parametric Statistics when one doesn't know or care about the data distribution then Median is the statistics of choice.  In many cases, Median is better choice than Mean:  it is more robust against outliers

I see this can be an issue. For consistency, the feature should be consistent in its content.  For instance, you can convert the values to a given unit, but still, keep the feature describing which unit they were recorded in.  

0 Kudos

An example, which is pretty close to my current situation, that comes to mind is what if there is one column that is an numerical value and another that is units. For example, measurements in feet or meters. Replacing a missing feet value with any kind of measure of central tendency that includes meters - would likely guess too low as a physical matter of fact.

@dalilaB 

This hits some points I have been trying to clarify in my own mind.

Thanks.

I suspect we are coming from different directions. My position is that using any value in place of missing values is making potentially self supporting but unjustified assumptions about the data.

Effectively, one is pre-modelling the data using a simple model (a constant) and injecting that into the Data Robot modelling process.

One could argue that this is a form of boosting, of course.

But, surely, it would be more correct for Data Robot to acknowledge that it cannot use the information that is missing?

And, if modelling the data was a good idea - then why not use some more sophisticated but still simple pre-modelling based on conditional probabilities to guess a stochastic functional dependency on some of the other data? For example, one could look for correlation between the data fields and then guess the missing data based on that correlation?

What if the data is log normal rather than normal. Then the median seems rather less relevant. Is there any justification for the median beyond contemporary data science standard practice? 

 

 

0 Kudos

It just depends on how much data you have.  If you have a large sample, then using the median makes sense, as you can know the data characteristics can state that the median is the best representative of your unseeing data.  One may argue for the mean, but the mean is very sensitive to outliers.    

0 Kudos

@dalilaB 

 

Thanks for that and the reference - but it brings to mind a couple of questions. Why impute with the median if the data is missing. If the data is missing then guessing the median is hardly universally a good idea. And it says that it imputes with -9999, but seems to expect that to be less than all the other data. What if it is not? I have been using 123456789 at times for this purpose as being more identifiable and more away from the rest of the data. Thoughts?

 

0 Kudos

Yes, DataRobot does deal with missing values.  If you have a numerical feature, it imputes them with the median and then adds a binary missing indicator flag. Documentation is  here.   However, if you like, you can add your own R or Python imputing function to your Blueprint. 

If I were to change this to put in a null, and (as I understand you) data robot puts in a missing value column, would that have a better effect?

Yes, as 0 is never considered a representation of missing values.  

 

 

@jarred Can you clarify whether Data Robot pays special attention to the missing data column. I have some code that replaces missing numerics by 0s and then has its own binary missing data column. But, I get the impression that Data Robot sees the 0s and uses that as a genuine 0 numeric value, thus reducing the quality of the conclusions.

 

If I were to change this to put in a null, and (as I understand you) data robot puts in a missing value column, would that have a better effect?

 

My worry with Excel is you might not realize one row contains a space in a numeric field, or various other interesting things that may be hidden from your view when visually scanning the file or saving it as an xlsx.  One of the larger risks you run is in date formats; especially if you got some data, Excel "helped" display it in a different format (and saved it), and then your production scoring job is pulling database data in another format.  Working with csv you will know exactly what is in your raw data however.  (In particular UTF-8 encoding, at that.)

0 Kudos

@Alex Thanks for the follow up on this. Yes, you're experiencing the joys of datatyping based on file type. I'm glad to hear that you were able to resolve this.

0 Kudos