Data Quality

Data Quality Assessment saves you time when dealing with data issues and reduces the risk of missing problems before you start modeling.

Figure 1. Data QualityFigure 1. Data Quality

DataRobot can identify a number of data quality issues such as target leakage, outliers, missing values, and inconsistent gaps in time for time series projects, as well as missing images and broken links for visual AI projects. By surfacing excess zeros, leading zeros, and trailing zeros, DataRobot gives you better insight into values that may be disguised as missing.

If you click View info on the Data Quality Assessment box, you will see a summary of what this process surfaced.

Figure 2. Quality SummaryFigure 2. Quality Summary

You can look at any feature flagged by the Data Quality Assessment process in more detail by selecting it from the Project Data table and examining the histogram.

Figure 3. HistogramFigure 3. Histogram

Many issues with data quality are handled at the blueprint level. From the Leaderboard (Models page) you can investigate how these issues are handled in each model built. Just select one of the models and click Describe > Data Quality Handling Report. You’ll see see exactly what DataRobot did to automatically handle those cases for you in the current blueprint.

Figure 4. Quality ReportFigure 4. Quality Report

For example, in Figure 4 you can see that there were some missing values for mths_since_last_record. The report log explains that this value was imputed, and shows the imputed value.

Thank you for reading. If you have any questions, then please post them below.

More Information

If you’re a licensed DataRobot customer, search the in-app Platform Documentation for Data quality assessment.

Labels (1)
Version history
Revision #:
6 of 6
Last update:
‎07-30-2020 03:14 PM
Updated by: