After you’ve uploaded your data into DataRobot and EDA1 has completed, you're ready to explore your data and set up your project to begin building models. To explore your data, you can either click the link labeled Explore (and your dataset name) at the bottom of the page, or you can simply scroll down.
Figure 1. Exploring data
You will see a list of all features in the uploaded dataset. The display presents the automatic identification of data types that DataRobot has done. DataRobot supports the following data types: numeric, categorical, dates, percentages, currencies, lengths, and free text. For the numeric data, you see some summary statistics such as min, max, mean, median, and the standard deviation, as well as the number of unique and missing values.
Figure 2. List of all features in the dataset
You can further explore any feature by clicking on it, which displays a histogram of the data within that feature at selectable levels of bin granularity. The data may also be displayed in the form of the most frequent values or as a table. You can change the data type that DataRobot automatically assigned, such as from numeric to categorical, from categorical to text, etc.
Figure 3. Histogram of data for a selected feature
To the left of every feature name is a check box that appears when you hover the mouse over it. This allows you to select features to create feature lists (these are discussed in greater detail in other materials).
Once the dataset is uploaded, in order to proceed, DataRobot needs to know the target (that is, which feature you want to predict.) You can either hover over a feature and click Use as Target, or you can simply type the name of the feature that you want to use in the text field in the upper left of the screen under What would you like to predict?
Once selected, you will see a histogram of the target displayed. Given the data type of the target feature, DataRobot will recognize the type of data science problem as classification or regression. If a suitable date and time feature data is available, DataRobot’s time series option will be available to select.
Figure 4. Specify the target feature here
Also after the target is selected, a link at the bottom of the page displays Show Advanced Options. This allows you to set a variety of configurations, including the optimization metric to use for modeling, different partitioning schemes, downsampling, and many more. Other materials will discuss these settings in detail, but it is central to note that the default settings provide guardrails enabling less experienced data scientists, engineers, analysts, etc. to proceed with building excellent models without additional understanding or configuration. However, DataRobot does also provide fine grain control for users who would like to specify those settings.
Figure 6. Advanced Options for modeling configuration
Going back up to the top of the page, you see the Start button. When clicked, this will initiate the modeling process. Underneath the Start button you see ModelingMode, FeatureList, and OptimizationMetric.
Modeling Mode indicates how to build models, with options Autopilot, Quick, and Manual. This specifies the process and workflow DataRobot uses to build models.
FeatureList points DataRobot to the set of features to use to train the models.
The OptimizationMetric is the means by which the model is trained (or optimized); for example LogLoss, RMSE, etc.
Figure 7. Initiate model building with selected options
After you click the Start button, DataRobot begins the model training. Given the dataset with the type of features present (e.g., text features, categorical, dates, etc.), the type of target, and the type of project, DataRobot will select a subset of models to train, score, and rank, and present these models on the Leaderboard for further evaluation and understanding analysis.
Figure 8. Leaderboard with built models
DataRobot trains models in a sequence of rounds to provide fast processing; only a portion of the data is used to find the best performing models. After each round, DataRobot selects only those models that perform best to proceed to the next round. Each successive round uses greater amounts of training data, moving towards building the best models with the full training dataset; DataRobot refers to this as a ‘survival of the fittest’ modeling competition.