Hello,
Both of my training and scoring datasets contain null values for some numeric variables. In fact files have ? character in places where the data is missing. I see that DataRobot recognizes that and imputes with an arbitrary value (e.g. -9999). Datarobot calculates predictions using scoring dataset with missing values (?) with no problems when I do it through Predict->Make Predictions option.
But when I deploy the model and try to pass the same scoring dataset using API I am getting error message
400 Error: {"message":"Column P1_XXXX was expected to be numeric"}
I tried replacing ? with null and with blank but still getting the same error. The dataset is UTF-8 comma delimited. So all these 3 combinations are triggering error message:
,?,?,
,null,null,
,,,
Any help would be appreciated.
Thanks
Solved! Go to Solution.
Thanks for that and the reference - but it brings to mind a couple of questions. Why impute with the median if the data is missing. If the data is missing then guessing the median is hardly universally a good idea. And it says that it imputes with -9999, but seems to expect that to be less than all the other data. What if it is not? I have been using 123456789 at times for this purpose as being more identifiable and more away from the rest of the data. Thoughts?
It just depends on how much data you have. If you have a large sample, then using the median makes sense, as you can know the data characteristics can state that the median is the best representative of your unseeing data. One may argue for the mean, but the mean is very sensitive to outliers.
This hits some points I have been trying to clarify in my own mind.
Thanks.
I suspect we are coming from different directions. My position is that using any value in place of missing values is making potentially self supporting but unjustified assumptions about the data.
Effectively, one is pre-modelling the data using a simple model (a constant) and injecting that into the Data Robot modelling process.
One could argue that this is a form of boosting, of course.
But, surely, it would be more correct for Data Robot to acknowledge that it cannot use the information that is missing?
And, if modelling the data was a good idea - then why not use some more sophisticated but still simple pre-modelling based on conditional probabilities to guess a stochastic functional dependency on some of the other data? For example, one could look for correlation between the data fields and then guess the missing data based on that correlation?
What if the data is log normal rather than normal. Then the median seems rather less relevant. Is there any justification for the median beyond contemporary data science standard practice?
An example, which is pretty close to my current situation, that comes to mind is what if there is one column that is an numerical value and another that is units. For example, measurements in feet or meters. Replacing a missing feet value with any kind of measure of central tendency that includes meters - would likely guess too low as a physical matter of fact.
I see this can be an issue. For consistency, the feature should be consistent in its content. For instance, you can convert the values to a given unit, but still, keep the feature describing which unit they were recorded in.
What if the data is log normal rather than normal. Then the median seems rather less relevant. Is there any justification for the median beyond contemporary data science standard practice?
In non-parametric Statistics when one doesn't know or care about the data distribution then Median is the statistics of choice. In many cases, Median is better choice than Mean: it is more robust against outliers