(Article updated January 2021.)
This section provides answers to frequently asked questions related to setting up modeling. If you don't find an answer for your question, you can ask it now; use Post your Comment (below) to get your question answered.
Yes, you can apply transformations such as Log(x) and x^2, and you can create custom transformations using the f(x) transform option. Custom transformations allow you to create new variables that are a function of other variables in your data.
In addition, when DataRobot identifies a feature column as variable type ‘date,’ it automatically creates transformations of those qualifying features.
You can also do variable transformations to modify the type assigned by DataRobot by selecting the features and choosing Change Variable Types from Menu. You may want this, for example, if area codes are interpreted as numeric but you would rather they map to categories.
Train-time image augmentation (shown below) is a Visual AI public beta feature for Release 6.3. This processing step in the DataRobot blueprint creates new images for training by randomly transforming existing images, thereby increasing the size of (i.e., “augmenting”) the training data. If you're a customer and are interested in this beta feature, contact your CFDS or account executive.
More information for DataRobot users: search in-app Platform Documentation for Feature transformations orTrain-time image augmentation.
For a binary classification problem, you can choose what target value is assigned as the positive class from Advanced Options > Additional tab, after EDA1 finishes. After model building, you can then set the threshold (records with values above the threshold are assigned to the positive class) in various places in the app:
More information for DataRobot users: search in-app Platform Documentation for Show Advanced Options link, then locate information for “Positive class assignment (binary classification only)." For threshold information, search for ROC Curve and locate the "Prediction threshold" section.
DataRobot chooses from a comprehensive set of metrics and recommends one well-suited for your data. If desired, you can change the selection from the Advanced Options > Additional tab after EDA1 completes (as shown in the image below). If there is another metric you would like to see implemented in DataRobot, you can contact Support or your CFDS.
More information for DataRobot users: search in-app Platform Documentation for Optimization metrics.
No. Class imbalance is an issue only if we evaluate models using simple metrics like percentage accuracy. To address this, DataRobot automatically optimizes models for objectives that are both aligned with the project metric and robust to imbalanced targets. See the FAQ "How does DataRobot handle imbalanced data (class imbalance)?"
To do this, create a feature list containing the features you want to be monotonically increasing (and a separate list for features that you want to be monotonically decreasing, if needed).
Then, specify those feature lists in the Advanced Options > Feature Constraints tab. These feature lists will be a subset of the feature list you use for modeling. You can also create lists and retrain models from the Leaderboard after the initial model run.
Once models are built you can expand a constrained model and, from the Describe > Constraints tab, review the features that were constrained.
More information for DataRobot users: search in-app Platform Documentation forMonotonic constraints or Monotonic modeling considerations.
Exploratory data analysis, or EDA, is DataRobot's approach to analyzing datasets and summarizing their main characteristics. It consists of two phases:
More information for DataRobot users: search in-app Platform Documentation for Overview and EDA then locate “Understanding Exploratory Data Analysis (EDA).”
DataRobot creates a histogram for each feature when EDA1 completes and displays it on the Data page.These histograms use all of the data, up to 500MB. For datasets larger than 500MB, a random 500MB sample is used. For histograms produced by EDA2 (i.e., after you have told DataRobot what the target variable is), all the data except for the holdout (if any) and any rows with missing target values are used.
More information for DataRobot users: search in-app Platform Documentation for Data tab, then locate information for "Working with the Histogram chart."
DataRobot handles missing values differently, depending on the model and/or value type. It recognizes special NaN values and reports missing values in several of the visualizations, starting with the EDA1 report on the Data page. Additionally, DataRobot runs a check for (and reports) “disguised missing values,” the term applied to a situation when a value (for example, -999) is inserted to encode what would otherwise be a missing value.
More information for DataRobot users: search in-app Platform Documentation for Data Quality Assessment, or search for Modeling process details and then locate information for "Handling missing values."
Yes, you can share a project with other active accounts in your organization. The recipient can be granted owner (read-only access), editor (read/write access), or consumer (read/write/administrator) privileges.
You can share the active project from the Project share tool:
You can share a single project from the Actions menu in the Manage Projects control center:
Or, also from the Manage Projects control center, you can select multiple projects to share at once:
More information for DataRobot users: search in-app Platform Documentation for Create and Manage projects or Authentication, roles, and permissions.
Note: Only on-premise users (i.e., non-Managed AI Cloud users) can have projects recovered, so use caution when deleting projects.
Use the Manage Projects control center to see all your projects, then select Delete from the Actions menu for the applicable project.
Or, also from the Manage Projects control center, you can select multiple projects to delete at once:
More information for DataRobot users: search in-app Platform Documentation for Create and Manage projects.
No, you can use the Manage Projects control center to view all your projects,
then select Duplicate from the Actions menu next to the applicable project.
You have an option to copy just the data, or for non-time series projects, the data and the settings.
More information for DataRobot users: search in-app Platform Documentation for Create and Manage projects.
Setting weighting for a feature configures that single feature to be used as a differential weight, indicating the relative importance of each row of data for that feature. Weighting is used when building or scoring a model (for computing metrics on the Leaderboard) but not for making predictions on new data.
Make sure to name the column clearly, for example, "PriorityWeight." You can find this setting in the advanced modeling parameters (Advanced Options > Additional tab) after EDA1 completes.
Under Additional, you will find Weight. Enter the name of the feature containing the weight information, such as "PriorityWeight" in our example:
More information for DataRobot users: search in-app Platform Documentation for Show Advanced Options link then locate information for "Additional weighting details.”
Exposures and offsets are commonly used for insurance loss modeling. Both are both treated as special features in data analysis and prediction. You can add exposures and offsets to your data from the Advanced Options page.
More information for DataRobot users: search in-app Platform Documentation for Show Advanced Options link, then locate information for “Additional weighting details.”
The importance bars show the degree to which a feature is correlated with the target. These bars are based on ACE scores, or "Alternating Conditional Expectations" scores. ACE scores are capable of detecting non-linear relationships with the target, but as they are univariate they are unable to detect interaction effects between features. Importance is calculated using an algorithm that measures the information content of the variable; this calculation is done independently for each feature in the dataset.
More information: see the paper, “Estimating Optimal Transformations for Multiple Regression Using the ACE Algorithm”.
For DataRobot users: search in-app Platform Documentation for Modeling process details, then locate information for “Importance bars.”
These informational tags identify feature characteristics discovered by DataRobot during EDA1. Features with these tags are deemed uninformative, and the gray text in front of each feature name describes the reason that feature was found to be uninformative. These features are excluded from the list of Informative Features that DataRobot creates.
More information for DataRobot users: search in-app Platform Documentation for Feature lists and then locate information for "Data page informational tags.”
DataRobot attempts to detect reference IDs (unique sequential numbers) in datasets. If found, these features will be labeled with an informational tag in the Data page.
For smaller datasets (typically those with fewer than 2000 records), attempting to automatically identify Reference ID columns can lead to false positives (i.e., incorrectly labeling columns as Reference ID). Therefore, especially with smaller datasets, you should manually preprocess the data to remove reference IDs or create a feature list that excludes them.
More information for DataRobot users: search in-app Platform Documentation for Feature lists and then locate information for "Data page informational tags."
DataRobot creates a histogram for each feature when EDA1 completes and displays it on the Data page. The histogram shows the number of rows of data that have a specific feature value.
This symbol indicates that the feature has been transformed—either by you or automatically by DataRobot. For more information about transformations, see the FAQ “Is it possible to do manual feature transformation in DataRobot?”
DataRobot starts by performing EDA1 on your data. This includes activities such as encoding variables, cleaning up missing values, transforming features, searching for interactions, identifying non-linearities, and so forth. Preparation tasks such as merging multiple data sources into a single dataset can be accomplished with the DataRobot Paxata data prep tools.
After EDA1 runs, data can be manually prepared using DataRobot’s inbuilt SparkSQL engine and interface. If you want to use a code-free interface instead, DataRobot Paxata provides robust point-and-click data preparation capabilities.
Lastly, DataRobot offers unique Feature Discovery capabilities that take care of joining your primary table with many secondary tables, but creates aggregates, performs text mining, and utilizes a variety of other approaches to bring the best features possible into the modeling process.
More information for DataRobot users: search in-app Platform Documentation for Overview and EDA. Also, you can find information for DataRobot Paxata data prep in the DataRobot Paxata Official Cloud Documentation.
DataRobot supports a wide array of Natural Language Processing (NLP) tasks. When text fields are detected in your data, DataRobot automatically detects the language and applies appropriate preprocessing. This may include advanced tokenization, data cleaning (stop word removal, stemming, etc.), and vectorization methods. DataRobot supports n-gram matrix (bag-of-words, bag-of-characters) analysis as well as word embedding techniques such as Word2Vec and fastText with both CBOW and Skip-Gram learning methods. Additional capabilities include Naive Bayes SVM and cosine similarity analysis. For visualization, there are per-class word clouds for text analysis. DataRobot is continuously expanding the NLP capabilities.
More information for DataRobot users: search in-app Platform Documentation for Coefficients (and preprocessing details).
Yes. By default, DataRobot splits your data into a 20% holdout (test) partition and the remaining 80% over five-fold cross-validation (training and validation) partitions.
After loading data and selecting a target, you can change these values from the Advanced Options > Partitioning tab. From there, you can set the method, sizes for data partitions, number of partitions for cross-validation, and the method by which those partitions are created.
The default partitioning method is “random” for regression and “stratified” for classification, but other appropriate partitioning methods are possible. For time-dependent data, you can select Date/Time partitioning (aka Out-of-Time Validation or OTV). Column-based partitioning (Partition Feature) or Group partitioning (Group) can be used to create a more deterministic partitioning method.
More information for DataRobot users: search in-app Platform Documentation for Show Advanced Options then locate information for "Setting partitioning and model validation" and for "Partitioning methods."
DataRobot has a number of guardrails to make sure that imbalanced data is treated appropriately. One guardrail is a set of metrics which are robust even when the target variable is imbalanced. Some of those metrics are: LogLoss and the Matthews Correlation Coefficient (MCC). You can find the LogLoss metric and Max MCC in the Metric dropdown menu on the Leaderboard. MCC is also found in the ROC Curve tab.
More information for DataRobot users: search in-app Platform Documentation for Optimization metrics, then locate information for "LogLoss / Weighted LogLoss."
DataRobot will not allow you to delete a model that is deployed.
Yes, you can create new features that are derived from others in your data. See the FAQ "Is it possible to do manual feature transformation in DataRobot?"
By default, DataRobot splits the data into 20% holdout (test) and 80% over five-fold cross-validation (training and validation). These values can be changed in the Advanced Options > Partitioning page. For more information, see the FAQ "Can I control how to group or partition my data for model training?"
If the project is classified as regression and eligible for multiclass conversion, you can click the Switch To Classification link below the target entry box. This changes the project to a classification project, and DataRobot will interpret values as classes instead of continuous values.
If the number of unique values falls outside the allowable range, i.e., more than 100, the Switch To Classification link is not enabled.
From the Leaderboard: Select the model you wish to delete by clicking the box to the left of the model name, and then click Menu from above the list.
Clicking Menu will open a dropdown menu. Click Delete Selected Model.
Confirm the deletion in the displayed message.
Pro tip: You can delete all, or multiple, models at the same time by clicking Model Name & Description. Deselect any that you do not want to delete and then use the same Menu process to delete all selected models.
Upon uploading data, DataRobot will automatically detect and identify common data quality issues. The Data Quality Assessment report displays the data quality issues, for example:
And the Data Quality column in the Project Data table identifies the quality issues detected for the related features. You can hover over a yellow triangle to see the related quality issues, such as Target leakage or Outliers:
You can see additional information in this Data Quality community post.
More information for DataRobot users: search in-app Platform Documentation for Data Quality Assessment.
You are not limited to using the DataRobot UI; DataRobot also provides a REST API. The UI and API provide nearly matching functionality.
Additionally, the DataRobot R and Python clients provide a subset of what you can do with the full API. If you find functionality that is not exposed via the API, DataRobot wants to know so that developers can prioritize that feature. (You can comment here on this article or tell your DataRobot representative.)
More information for DataRobot users: see the in-app API Documentation. (The API documentation is also available from the DataRobot Support site.)
Yes, DataRobot-licensed users can install an Excel add-in to help harness the power of DataRobot within a familiar Excel environment. The DataRobot Excel Add-In supports Microsoft Excel client-installed (not cloud-based) Windows versions 2010 through Office 365. The add-in helps orchestrate model training, validating, and selection, and enables you to deploy models to a dedicated prediction server, leverage model monitoring, and get predictions.
More information for DataRobot users: search in-app Platform Documentation for Excel Add-in.
Yes, DataRobot Tools enable you to create projects and make predictions without leaving the Alteryx interface.
You can download Alteryx from https://s3.amazonaws.com/datarobot-public-external-connectors/DataRobotTools.yxi. Once the download completes, double-click the file and follow the instructions to install.
More information for DataRobot users: search in-app Platform Documentation for Tools for Alteryx.
Yes, the DataRobot extensions for Tableau, downloadable from the Tableau Extensions Gallery, are configured to work with DataRobot Managed AI Cloud (i.e., app.datarobot.com). If your organization runs On-Premise AI Cluster, Private AI Cloud, Hybrid AI Cloud, or EU Managed AI Cloud, you must change the extension configuration to work with your deployment.
More information for DataRobot users: search in-app Platform Documentation for Modifying the Tableau Extension URL.
DataRobot has built-in capabilities to extract data from over 15 of prominent databases, data warehouses, and data lakes. This data can be transformed with DataRobot’s SparkSQL engine. With the acquisition and integration of Paxata, DataRobot has expanded its ETL (Extract-Transform-Load) capabilities.
See the DataRobot press release, "DataRobot Acquires Paxata to Bolster its End-to-End AI Capabilities."
Modeling servers power all the analysis you do from the UI and from the R/Python clients. Modeling worker resources are typically used to build models, hence they are called "modeling workers."
Prediction servers are used solely for making predictions and handling prediction requests on deployed models. These separate, stand-alone resources ensure that a queue for different request types and worker types doesn't become a bottleneck in your AI processes. If your deployed model makes real-time predictions, using a dedicated prediction server will ensure its performance.
More information for DataRobot users: search in-app Platform Documentation for UI prediction options or, for non-Managed AI Cloud, Standalone Prediction Server.
DataRobot can import a SAS file directly (*.sas7bdat). You can also call DataRobot via the API from within SAS by using Proc HTTP.
DataRobot can ingest text, Excel, SAS, and various zipped files. Supported file formats are listed at the bottom of the new project page:
Dataset requirements are dependent on the type of project you're creating, such as AutoML, time series, or Visual AI.
More information for DataRobot users: search in-app Platform Documentation for Dataset requirements.
DataRobot can ingest data from a JDBC-enabled data source, S3, Azure Blob, Google Cloud Storage, URL, Hadoop Distributed File System, a local file, or from the DataRobot AI Catalog.
More information for DataRobot users: search in-app Platform Documentation for Non-catalog import methods and AI Catalog.