Recently I trained a classification model on the platform and am observing some difference in Data distribution between training and scoring data.
Plz see screenshot.
But when I chk my actual data for the COUNTRY variable, I am not seeing any new values. If you see in the screenshot, on the far right end we are seeing NEW VALUES in scoring set.
But here are the results from actual data.
values are PH, US, Other, GB, PK, IN, CN
Solved! Go to Solution.
It sounds then like sampling is not the explanation. At this point I would recommend filing a support ticket so that our team can do a more detailed investigation.
Appreciate your response. My final model, (which is being used for predictions) is trained on the entire train data, consisting of 80K rows. Plz see screenshot. I am not sure its the size of train data here, because the model is exposed to entire train data. Let me know if more information is needed here. It is crucial for me to know why is this drift occurring here.
How large was your training dataset? I know that the training baseline only consists of a sample of the training data, but it's fairly large (about 500 MB I think) so if your training dataset it small it would encompass the entire thing.