(Updated October 2020/Release 6.2)
DataRobot automatically conducts a variety of exploratory data analyses (EDA) for all of your projects. This article will cover how the DataRobot platform accomplishes EDA.
Once you upload your data, you can scroll down to see the features from your dataset.
Figure 1. Features
This dataset is from a readmitted use case, where the goal is to predict the likelihood of patient readmission to the hospital. In this case the feature readmitted is our target feature and the other features will be used to predict that target. You can see that DataRobot has identified the feature types (Var Type) and given us some summary statistics about the features (Unique, Missing, Mean, etc.).
Figure 2. Stats
You can click on the different features to learn more about them. For example, when you click on a numeric feature, a Histogram will drop down. Figure 3 shows a histogram for the number of days spent in the hospital.
Figure 3. Histogram
Notice the histogram is binned. In the bottom left corner of the histogram, you can change the number of bins. If you click on Frequent Values for a numeric feature you will see the same information, unbinned. You can also view this data in a Table format.
Figure 4. Frequent Values
DataRobot is good at determining feature types, but you should always use domain expertise to make sure the results are correct. You can think of a scenario when a numeric feature isn’t really a meaningful number. For example, flight codes are numerics but have no inherent numeric value. In that case, you would want to use Var Type Transform to change the numeric to a categorical.
Figure 5. Transformation
You can also see some features that have grayed-out text next to the feature names. This text can say things like “reference id,” “few values,” or “duplicate.” The gray text is designed to make you aware of any potential inferences the platform is making about the informative value of that feature.
Figure 6. Informative Value/Gray Text
In addition to assessing the informative value of features, DataRobot also does a Data Quality Assessment. The platform will do this before and after you press Start:
- Before you press Start, the assessment will look for things like outliers, inliers, disguised missing values, and zero inflation.
- After you press Start, DataRobot will add Target Leakage detection to the assessment.
Figure 7. Data Quality
After you hit Start, DataRobot does another round of EDA. This includes the target leakage detection just discussed as well as detecting non-linear correlation between the features and the target. This is indicated in the Data tab through green bars in the Importance column.
Figure 8. Importance
If you click on the features after this second EDA is complete, you will also see the relationship between the feature and the target indicated by an orange line. This graph has a dual axis. The left Y-axis (blue) represents the number of rows in each bin. The right Y-axis (orange) represents the percentage of rows that have reached the target. Importantly, this is before any predictive modeling takes place and is itself not a predictive model.
Figure 9. Histogram II
You can also find a Feature Association matrix when you navigate to the Data > Feature Associations tab. This matrix shows you the relationships among your features. Here you can quickly see the top ten associations and the number of clusters present in your data. This is calculated using the metric Mutual Information, but you can switch to Cramer's V.
Figure 10. Feature Association Matrix
More Information
If you’re a licensed DataRobot customer, search the in-app Platform Documentation for Overview and EDA and Time series modeling.