Just as DataRobot might introduce it's own definitions to stay within the scope of the DataRobot platform, how does DataRobot define, in its own way, the difference between an Anomaly and an Outlier?
The reason for this being, the difference in algorithms used to detect either one. Aren't the two concepts, one in the same?
Solved! Go to Solution.
Hey @jas0n !
Thank you for the awesome answer! This distinction automatically reminded me of, although not the same but awefully close, to the concept between between how Feature Importance and Feature Impact are interpreted.
One looks at individual features and the other looks at the feature in the context of other features under the eye of the model.
Having said that, again thank you for the explanation!
Hi DREnthusiast,
Excellent questions!
To your point, Data Science terminology is inconsistent and constantly evolving, and there may be some overlap. In this case I think I can provide at least somewhat of a distinction, however.
We can think of an outlier as a value of a feature significantly outside of of the normal range of that feature (in other words, it's a univariate concept). It's something we worry about at training time, where you can see if DR detected potential outliers in its Quality Assessment, and also at prediction time when we monitor Data Drift in MLOps.
Anomalies are definitely conceptually related, but a major difference is that rather than apply to the data range of 1 feature (versus a given value of that feature) it applies to an entire row of data across multiple features. Further, it's possible that a row of data may be considered anomalous even if all of its individual feature values are not outliers, because there may be a new combination of features never seen before and quite different from the combinations seen during training.
For example, if a feature is "Number of medications prescribed" and has a range of 0 - 20 with a mean of 2.3 but a new record contains a value of 999, then that 999 is an outlier. On the other hand, suppose that in the same data set we have an "Patient's age" feature, ranging 0 - 110 with a mean of 27. If a new record (row of data) has a value 0 for "Number of medications prescribed" and 110 for "Patient's age" then it technically would not contain outliers, but it may be anomalous if very elderly patients are almost always prescribed some medications.
I hope this helps!