The Data Quality Assessment mentions that I have no inliers. I have a few questions on inliers:
Solved! Go to Solution.
@tobysimpson You linked from the docs you are accessing in the AI Cloud Platform Trial. This information is available via the public documentation also from https://docs.datarobot.com/en/docs/data/analyze-data/data-quality.html#inliers.
Again, thank you for your reply tobysimpson!
Thanks for the detailed response @MR ! Makes sense. Great point that they generally wouldn't impact statistical results (if there are only a few, I'm assuming). Thanks for the references too!
I see a helpful explanation in the on-line help from here https://app2.datarobot.com/docs/data/data-mgmt/data-analysis/data-quality.html#inliers
Inliers are values that are consistent with the bulk of the data, but wrong for a particular row (for example, a car rental company using a local zip code for an international customer). If not handled, they could negatively affect model performance.
How they are detected: For each value recorded for a feature, DataRobot computes the value's frequency for that feature and makes an array of the results. Inlier candidates are the outliers in that array. To reduce false positives, DataRobot then applies another condition, keeping as inliers only those values for which:
frequency > 50 * (number of non-missing rows in the feature) / (number of unique non-missing values in the feature)
The algorithm allows inlier detection in numeric features with many unique values where, due to the number of values, inliers wouldn’t be noticeable in a histogram plot. Note that this is a conservative approach for features with a smaller number of unique values. Additionally, it does not detect inliers in features with fewer than 50 unique values.
How they are handled: A binary column is automatically added inside of a blueprint to flag rows with inliers. This allows the model to incorporate possible patterns behind abnormal values. No additional user action is required.