cancel
Showing results for 
Search instead for 
Did you mean: 

Passing null values through DataRobot API

Highlighted
Blue LED

Hello,

Both of my training and scoring datasets contain null values for some numeric variables. In fact files have ? character in places where the data is missing. I see that DataRobot recognizes that and imputes with an arbitrary value (e.g. -9999).  Datarobot calculates predictions using scoring dataset with missing values (?) with no problems when I do it through Predict->Make Predictions option.

But when I deploy the model and try to pass the same scoring dataset using API I am getting error message

400 Error: {"message":"Column P1_XXXX was expected to be numeric"}

I tried replacing ? with null and with blank but still getting the same error. The dataset is UTF-8 comma delimited. So all these 3 combinations are triggering error message:

,?,?,

,null,null,

,,,

 

Any help would be appreciated.

Thanks

7 Replies
Highlighted
Data Scientist
Data Scientist

Hey @alexshif , sorry you've encountered some challenges here. As you say, DataRobot will automatically handle missing values, but you've stumbled across the key question: 'How do you represent a missing value?'

While using strings like '?' or 'null' or some other characters makes sense to humans, its not as relevant for the automated checks. I would recommend either using `numpy.nan` (if using python code) `NULL` (if using R or SQL code) or just delete the entry (if using excel or some other system that allows it). Any of those 'program language' null values should be appropriately handled by DataRobot.

Make sense?

0 Kudos
Highlighted
Blue LED

Thanks, @jarred

We use Trifacta tool to prepare and generate datasets for DataRobot. The data in those datasets originate from multiple data sources. The option of deleting all entries containing at least one missing value is not feasible in my case since vast majority of 10,000 entries that I need to predict contain at least 1 missing value across over 100 features.

I guess my question was primarily about DR different behaviors when interpreting missing values in input datasets. The difference that I observed was between uploading scoring data manually through the interface or deploying the model and sending it over API.

My workaround was to explicitly impute missing numeric values with -9999 in scoring datasets – this is the same value that DR was using to impute missing values when building models with training data. The model that was deployed is Gradient Boosted Greedy Trees Classifier with Early Stopping. The generated predictions were validated and proved to be identical when comparing the file generated through Predict->Make Predictions option with scoring data containing missing values and file generated by deployed model using scoring data with imputed (-9999) missing values.

Do you see any flaws in my approach?

Thanks,

Alex

 

 

0 Kudos
Highlighted
Data Scientist
Data Scientist

@alexshif 
Yes that all makes sense.

To clarify, I wouldn't recommend deleting rows containing NaN/Null-values, but it may be helpful to replace specific sub-optimal Null/NaN-values like "?", "...", etc. if possible. As you identify, replacing those Null-values with -9999 is one of the techniques used within DataRobot, so it isn't a bad approach as a tree-based model will know what to do with that -9999 value.

However, it won't be treated quite the same as if DataRobot detected a Null-value as we also create a new 'missing value in column X' indicator-column as well as replacing the Null-value (either with -9999 for tree-based models or imputing the value for coefficient based models). But it is difficult to know how much impact filling Null-values with -9999 yourself vs. letting DataRobot do it will have in your use-case. As with most things, we are empirically-driven so, if possible, it would be interesting to run that small experiment and see if the predictions for those rows differ with the two approaches. If you test it, I would be interested to hear the results either in this thread or shoot me an email (jarred.bultema@datarobot.com).

Highlighted
DataRobot Employee
DataRobot Employee

My expectation is that minimally the empty strings, if passed to DataRobot as csv with ,,, - should suffice and work.  Does this feature happen to use custom transforms you made inside DataRobot for the project?  It shouldn't be necessary for you to take any extra steps at deployment time, and I would advise opening up a ticket with support@datarobot.com to review the situation as well.

0 Kudos
Highlighted
Blue LED

@jarred and @doyouevendata, thank you both for your replies.

I just wanted to update everyone regarding the issue I encountered.

It turned out that I used Excel file for training dataset to build the original model. After the model was deployed, I realized that I can only use either CSV or JSON formats to pass data through API for predictions. So, I ended up taking scoring file in Excel format and saving it as CSV and passing it through DR API. That is when I started seeing 400 Errors:

400 Error: {"message":"Column P1_XXXX was expected to be numeric"}

To remediate the issue I decided to rebuild the model using training dataset in CSV format and passing missing values as empty strings ,,, without explicit imputation. Then I deployed this model and used CSV scoring file and python integration code to generate predictions. Everything worked this time with no errors.

I guess the lesson I learned here is to use the same file format (CVS in this case) for both training and scoring datasets to avoid data type errors.

Highlighted
Data Scientist
Data Scientist

@Alex Thanks for the follow up on this. Yes, you're experiencing the joys of datatyping based on file type. I'm glad to hear that you were able to resolve this.

0 Kudos
Highlighted
DataRobot Employee
DataRobot Employee

My worry with Excel is you might not realize one row contains a space in a numeric field, or various other interesting things that may be hidden from your view when visually scanning the file or saving it as an xlsx.  One of the larger risks you run is in date formats; especially if you got some data, Excel "helped" display it in a different format (and saved it), and then your production scoring job is pulling database data in another format.  Working with csv you will know exactly what is in your raw data however.  (In particular UTF-8 encoding, at that.)

0 Kudos