cancel
Showing results for 
Search instead for 
Did you mean: 

Passing null values through DataRobot API

alexshif
Image Sensor

Passing null values through DataRobot API

Hello,

Both of my training and scoring datasets contain null values for some numeric variables. In fact files have ? character in places where the data is missing. I see that DataRobot recognizes that and imputes with an arbitrary value (e.g. -9999).  Datarobot calculates predictions using scoring dataset with missing values (?) with no problems when I do it through Predict->Make Predictions option.

But when I deploy the model and try to pass the same scoring dataset using API I am getting error message

400 Error: {"message":"Column P1_XXXX was expected to be numeric"}

I tried replacing ? with null and with blank but still getting the same error. The dataset is UTF-8 comma delimited. So all these 3 combinations are triggering error message:

,?,?,

,null,null,

,,,

 

Any help would be appreciated.

Thanks

15 Replies
jarred
DataRobot Alumni

Hey @alexshif , sorry you've encountered some challenges here. As you say, DataRobot will automatically handle missing values, but you've stumbled across the key question: 'How do you represent a missing value?'

While using strings like '?' or 'null' or some other characters makes sense to humans, its not as relevant for the automated checks. I would recommend either using `numpy.nan` (if using python code) `NULL` (if using R or SQL code) or just delete the entry (if using excel or some other system that allows it). Any of those 'program language' null values should be appropriately handled by DataRobot.

Make sense?

0 Kudos

Thanks, @jarred

We use Trifacta tool to prepare and generate datasets for DataRobot. The data in those datasets originate from multiple data sources. The option of deleting all entries containing at least one missing value is not feasible in my case since vast majority of 10,000 entries that I need to predict contain at least 1 missing value across over 100 features.

I guess my question was primarily about DR different behaviors when interpreting missing values in input datasets. The difference that I observed was between uploading scoring data manually through the interface or deploying the model and sending it over API.

My workaround was to explicitly impute missing numeric values with -9999 in scoring datasets – this is the same value that DR was using to impute missing values when building models with training data. The model that was deployed is Gradient Boosted Greedy Trees Classifier with Early Stopping. The generated predictions were validated and proved to be identical when comparing the file generated through Predict->Make Predictions option with scoring data containing missing values and file generated by deployed model using scoring data with imputed (-9999) missing values.

Do you see any flaws in my approach?

Thanks,

Alex

 

 

0 Kudos
jarred
DataRobot Alumni

@alexshif 
Yes that all makes sense.

To clarify, I wouldn't recommend deleting rows containing NaN/Null-values, but it may be helpful to replace specific sub-optimal Null/NaN-values like "?", "...", etc. if possible. As you identify, replacing those Null-values with -9999 is one of the techniques used within DataRobot, so it isn't a bad approach as a tree-based model will know what to do with that -9999 value.

However, it won't be treated quite the same as if DataRobot detected a Null-value as we also create a new 'missing value in column X' indicator-column as well as replacing the Null-value (either with -9999 for tree-based models or imputing the value for coefficient based models). But it is difficult to know how much impact filling Null-values with -9999 yourself vs. letting DataRobot do it will have in your use-case. As with most things, we are empirically-driven so, if possible, it would be interesting to run that small experiment and see if the predictions for those rows differ with the two approaches. If you test it, I would be interested to hear the results either in this thread or shoot me an email (jarred.bultema@datarobot.com).

My expectation is that minimally the empty strings, if passed to DataRobot as csv with ,,, - should suffice and work.  Does this feature happen to use custom transforms you made inside DataRobot for the project?  It shouldn't be necessary for you to take any extra steps at deployment time, and I would advise opening up a ticket with support@datarobot.com to review the situation as well.

@jarred and @doyouevendata, thank you both for your replies.

I just wanted to update everyone regarding the issue I encountered.

It turned out that I used Excel file for training dataset to build the original model. After the model was deployed, I realized that I can only use either CSV or JSON formats to pass data through API for predictions. So, I ended up taking scoring file in Excel format and saving it as CSV and passing it through DR API. That is when I started seeing 400 Errors:

400 Error: {"message":"Column P1_XXXX was expected to be numeric"}

To remediate the issue I decided to rebuild the model using training dataset in CSV format and passing missing values as empty strings ,,, without explicit imputation. Then I deployed this model and used CSV scoring file and python integration code to generate predictions. Everything worked this time with no errors.

I guess the lesson I learned here is to use the same file format (CVS in this case) for both training and scoring datasets to avoid data type errors.

jarred
DataRobot Alumni

@Alex Thanks for the follow up on this. Yes, you're experiencing the joys of datatyping based on file type. I'm glad to hear that you were able to resolve this.

0 Kudos

My worry with Excel is you might not realize one row contains a space in a numeric field, or various other interesting things that may be hidden from your view when visually scanning the file or saving it as an xlsx.  One of the larger risks you run is in date formats; especially if you got some data, Excel "helped" display it in a different format (and saved it), and then your production scoring job is pulling database data in another format.  Working with csv you will know exactly what is in your raw data however.  (In particular UTF-8 encoding, at that.)

0 Kudos

@jarred Can you clarify whether Data Robot pays special attention to the missing data column. I have some code that replaces missing numerics by 0s and then has its own binary missing data column. But, I get the impression that Data Robot sees the 0s and uses that as a genuine 0 numeric value, thus reducing the quality of the conclusions.

 

If I were to change this to put in a null, and (as I understand you) data robot puts in a missing value column, would that have a better effect?

 

dalilaB
Data Scientist
Data Scientist

Yes, DataRobot does deal with missing values.  If you have a numerical feature, it imputes them with the median and then adds a binary missing indicator flag. Documentation is  here.   However, if you like, you can add your own R or Python imputing function to your Blueprint. 

If I were to change this to put in a null, and (as I understand you) data robot puts in a missing value column, would that have a better effect?

Yes, as 0 is never considered a representation of missing values.  

 

 

@dalilaB 

 

Thanks for that and the reference - but it brings to mind a couple of questions. Why impute with the median if the data is missing. If the data is missing then guessing the median is hardly universally a good idea. And it says that it imputes with -9999, but seems to expect that to be less than all the other data. What if it is not? I have been using 123456789 at times for this purpose as being more identifiable and more away from the rest of the data. Thoughts?

 

0 Kudos
dalilaB
Data Scientist
Data Scientist

It just depends on how much data you have.  If you have a large sample, then using the median makes sense, as you can know the data characteristics can state that the median is the best representative of your unseeing data.  One may argue for the mean, but the mean is very sensitive to outliers.    

0 Kudos

@dalilaB 

This hits some points I have been trying to clarify in my own mind.

Thanks.

I suspect we are coming from different directions. My position is that using any value in place of missing values is making potentially self supporting but unjustified assumptions about the data.

Effectively, one is pre-modelling the data using a simple model (a constant) and injecting that into the Data Robot modelling process.

One could argue that this is a form of boosting, of course.

But, surely, it would be more correct for Data Robot to acknowledge that it cannot use the information that is missing?

And, if modelling the data was a good idea - then why not use some more sophisticated but still simple pre-modelling based on conditional probabilities to guess a stochastic functional dependency on some of the other data? For example, one could look for correlation between the data fields and then guess the missing data based on that correlation?

What if the data is log normal rather than normal. Then the median seems rather less relevant. Is there any justification for the median beyond contemporary data science standard practice? 

 

 

0 Kudos

An example, which is pretty close to my current situation, that comes to mind is what if there is one column that is an numerical value and another that is units. For example, measurements in feet or meters. Replacing a missing feet value with any kind of measure of central tendency that includes meters - would likely guess too low as a physical matter of fact.

dalilaB
Data Scientist
Data Scientist

I see this can be an issue. For consistency, the feature should be consistent in its content.  For instance, you can convert the values to a given unit, but still, keep the feature describing which unit they were recorded in.  

0 Kudos
dalilaB
Data Scientist
Data Scientist

What if the data is log normal rather than normal. Then the median seems rather less relevant. Is there any justification for the median beyond contemporary data science standard practice? 

In non-parametric Statistics when one doesn't know or care about the data distribution then Median is the statistics of choice.  In many cases, Median is better choice than Mean:  it is more robust against outliers