Exploratory Data Analysis (EDA)

cancel
Showing results for 
Search instead for 
Did you mean: 

Exploratory Data Analysis (EDA)

(Updated October 2020/Release 6.2)

DataRobot automatically conducts a variety of exploratory data analyses (EDA) for all of your projects. This article will cover how the DataRobot platform accomplishes EDA.

Once you upload your data, you can scroll down to see the features from your dataset. 

Figure 1. FeaturesFigure 1. Features

This dataset is from a readmitted use case, where the goal is to predict the likelihood of patient readmission to the hospital. In this case the feature readmitted is our target feature and the other features will be used to predict that target. You can see that DataRobot has identified the feature types (Var Type) and given us some summary statistics about the features (Unique, Missing, Mean, etc.). 

Figure 2. StatsFigure 2. Stats

You can click on the different features to learn more about them. For example, when you click on a numeric feature, a Histogram will drop down. Figure 3 shows a histogram for the number of days spent in the hospital.  

Figure 3. HistogramFigure 3. Histogram


Notice the histogram is binned. In the bottom left corner of the histogram, you can change the number of bins. If you click on Frequent Values for a numeric feature you will see the same information, unbinned. You can also view this data in a Table format. 

Figure 4. Frequent ValuesFigure 4. Frequent Values

DataRobot is good at determining feature types, but you should always use domain expertise to make sure the results are correct. You can think of a scenario when a numeric feature isn’t really a meaningful number. For example, flight codes are numerics but have no inherent numeric value. In that case, you would want to use Var Type Transform to change the numeric to a categorical.

Figure 5. TransformationFigure 5. Transformation

 You can also see some features that have grayed-out text next to the feature names. This text can say things like “reference id,” “few values,” or “duplicate.”  The gray text is designed to make you aware of any potential inferences the platform is making about the informative value of that feature. 

Figure 6. Informative Value/Gray TextFigure 6. Informative Value/Gray Text

In addition to assessing the informative value of features, DataRobot also does a Data Quality Assessment. The platform will do this before and after you press Start

  • Before you press Start, the assessment will look for things like outliers, inliers, disguised missing values, and zero inflation. 
  • After you press Start, DataRobot will add Target Leakage detection to the assessment. 
Figure 7. Data QualityFigure 7. Data Quality

After you hit Start, DataRobot does another round of EDA. This includes the target leakage detection just discussed as well as detecting non-linear correlation between the features and the target. This is indicated in the Data tab through green bars in the Importance column.

Figure 8. ImportanceFigure 8. Importance

If you click on the features after this second EDA is complete, you will also see the relationship between the feature and the target indicated by an orange line. This graph has a dual axis.  The left Y-axis (blue) represents the number of rows in each bin. The right Y-axis (orange) represents the percentage of rows that have reached the target. Importantly, this is before any predictive modeling takes place and is itself not a predictive model.  

Figure 9. Histogram IIFigure 9. Histogram II

You can also find a Feature Association matrix when you navigate to the DataFeature Associations tab. This matrix shows you the relationships among your features. Here you can quickly see the top ten associations and the number of clusters present in your data. This is calculated using the metric Mutual Information, but you can switch to Cramer's V.

Figure 10. Feature Association MatrixFigure 10. Feature Association Matrix

More Information

If you’re a licensed DataRobot customer, search the in-app Platform Documentation for Overview and EDA and Time series modeling.

Version history
Last update:
‎02-04-2021 07:54 PM
Updated by:
Contributors