What is an inlier?

What is an inlier?

The Data Quality Assessment mentions that I have no inliers. I have a few questions on inliers:

  • What exactly is an inlier?
  • What can be the danger of having them in my dataset? 
  • Are they something I can identify with my own eyes?
Labels (1)
1 Solution

Accepted Solutions

  • What exactly is an inlier?
    • An inlier is a data value that lies in the interior of a statistical distribution and is in error.
    • An inlier is an observation lying within the general distribution of other observed values, generally does not perturb the results but is nevertheless non-conforming and unusual
    • inliers are difficult to distinguish from good data values
    • ex : A simple example of an inlier might be a value in a record reported in the wrong units, say degrees Fahrenheit instead of degrees Celsius
  • What can be the danger of having them in my dataset? 
    • It will not generally affect the statistical results
    • the identification of inliers can sometimes signal an incorrect measurement, and thus be useful for improving data quality

 

 

View solution in original post

4 Replies

  • What exactly is an inlier?
    • An inlier is a data value that lies in the interior of a statistical distribution and is in error.
    • An inlier is an observation lying within the general distribution of other observed values, generally does not perturb the results but is nevertheless non-conforming and unusual
    • inliers are difficult to distinguish from good data values
    • ex : A simple example of an inlier might be a value in a record reported in the wrong units, say degrees Fahrenheit instead of degrees Celsius
  • What can be the danger of having them in my dataset? 
    • It will not generally affect the statistical results
    • the identification of inliers can sometimes signal an incorrect measurement, and thus be useful for improving data quality

 

 

I see a helpful explanation in the on-line help from here https://app2.datarobot.com/docs/data/data-mgmt/data-analysis/data-quality.html#inliers 

 

Inliers

Inliers are values that are consistent with the bulk of the data, but wrong for a particular row (for example, a car rental company using a local zip code for an international customer). If not handled, they could negatively affect model performance.

How they are detected: For each value recorded for a feature, DataRobot computes the value's frequency for that feature and makes an array of the results. Inlier candidates are the outliers in that array. To reduce false positives, DataRobot then applies another condition, keeping as inliers only those values for which:

frequency > 50 * (number of non-missing rows in the feature) / (number of unique non-missing values in the feature)

The algorithm allows inlier detection in numeric features with many unique values where, due to the number of values, inliers wouldn’t be noticeable in a histogram plot. Note that this is a conservative approach for features with a smaller number of unique values. Additionally, it does not detect inliers in features with fewer than 50 unique values.

How they are handled: A binary column is automatically added inside of a blueprint to flag rows with inliers. This allows the model to incorporate possible patterns behind abnormal values. No additional user action is required.

Thanks for the detailed response @MR ! Makes sense. Great point that they generally wouldn't impact statistical results (if there are only a few, I'm assuming). Thanks for the references too!

@tobysimpson You linked from the docs you are accessing in the AI Cloud Platform Trial. This information is available via the public documentation also from https://docs.datarobot.com/en/docs/data/analyze-data/data-quality.html#inliers.

Again, thank you for your reply tobysimpson!