At this point, your data is already uploaded. (If not, go back to the Scope and Upload post and follow along from there.) Now let's see if we can learn anything new about it from DataRobot.
In the Data tab scroll down to Data Quality to see the number of features and data points in this dataset, and the Data Quality Assessment, i.e., one of the many processes that DataRobot does for you automatically and instantly (saving you lots of time).
The Data Quality Assessment shows you the results of the automated check DataRobot performs to detect and flag common data quality issues. You can see that our dataset has outliers in 8 features/columns. Let’s filter the dataset to look at only those features with outliers, i.e., the data points at the far ends of the mean. In modeling, outliers can affect accuracy, which is why DataRobot searches for them and then alerts you in the pre-modeling stage. Use the toggle at the bottom of the Data Quality Assessment box to filter and unfilter affected features by type of issue detected.
Now let's scroll further to the ProjectData table, and click the column header Missing to see which features in the dataset have empty values. You can see here which of the features have the highest number of empty cells—Vendor INCO Term has over 4000 blanks. We know from our data dictionary that "Vendor INCO Term" is a 3-letter trade term developed by the International Chamber of Commerce for the sale of goods. Although it's okay to have features with missing values, you need to decide whether those features are important to what you are modeling. In our late delivery project, for example, it probably doesn’t matter if the missing values relate to the color of the packaging, but it does if they are the name of the product. Use your subject matter expertise to decide whether you need to revisit your dataset and try to supplement.
Finally, let's set the target. This is the feature in your data about which you want to gain a deeper understanding. In our scenario, we want to know which deliveries will be late. To do this, we want to set the target as the feature that captures the information about whether the delivery was late in the past. For our dataset, that column is Late_delivery, which captures the difference between the estimated and actual delivery dates. The possible values are 1 (meaning yes, the delivery was late) and 0 (meaning now, it was not late).
Specify Late_delivery as the target and hit Start.