Data Quality Assessment saves you time when dealing with data issues and reduces the risk of missing problems before you start modeling.
Figure 1. Data Quality
DataRobot can identify a number of data quality issues such as target leakage, outliers, missing values, and inconsistent gaps in time for time series projects, as well as missing images and broken links for visual AI projects. By surfacing excess zeros, leading zeros, and trailing zeros, DataRobot gives you better insight into values that may be disguised as missing.
If you click View info on the Data Quality Assessment box, you will see a summary of what this process surfaced.
Figure 2. Quality Summary
You can look at any feature flagged by the Data Quality Assessment process in more detail by selecting it from the Project Data table and examining the histogram.
Figure 3. Histogram
Many issues with data quality are handled at the blueprint level. From the Leaderboard (Models page) you can investigate how these issues are handled in each model built. Just select one of the models and click Describe > Data Quality Handling Report. You’ll see see exactly what DataRobot did to automatically handle those cases for you in the current blueprint.
Figure 4. Quality Report
For example, in Figure 4 you can see that there were some missing values for mths_since_last_record. The report log explains that this value was imputed, and shows the imputed value.
Thank you for reading. If you have any questions, then please post them below.
If you’re a licensed DataRobot customer, search the in-app Platform Documentation for Data quality assessment.