FAQs: Setting Up Models

cancel
Showing results for 
Search instead for 
Did you mean: 

We're looking into an issue with broken attachments right now. Please stay tuned!

FAQs: Setting Up Models

(Article updated January 2021.)

This section provides answers to frequently asked questions related to setting up modeling. If you don't find an answer for your question, you can ask it now; use Post your Comment (below) to get your question answered.

Is it possible to do manual feature transformation in DataRobot?

Yes,  you can apply transformations such as Log(x) and x^2, and you can create custom transformations using the f(x) transform option. Custom transformations allow you to create new variables that are a function of other variables in your data.

Is it possible to do feature transformations in DataRobot.png

In addition, when DataRobot identifies a feature column as variable type ‘date,’ it automatically creates transformations of those qualifying features.

You can also do variable transformations to modify the type assigned by DataRobot by selecting the features and choosing Change Variable Types from Menu. You may want this, for example, if area codes are interpreted as numeric but you would rather they map to categories.

lhaviland_1-1614632024363.png

Train-time image augmentation (shown below) is a Visual AI public beta feature for Release 6.3. This processing step in the DataRobot blueprint creates new images for training by randomly transforming existing images, thereby increasing the size of (i.e., “augmenting”) the training data. If you're a customer and are interested in this beta feature, contact your CFDS or account executive.

lhaviland_1-1610297069792.png

More information for DataRobot users: search in-app Platform Documentation for Feature transformations orTrain-time image augmentation.

Can I specify which target value will be used as the positive class in DataRobot?

For a binary classification problem, you can choose what target value is assigned as the positive class from Advanced Options > Additional tab, after EDA1 finishes. After model building, you can then set the threshold (records with values above the threshold are assigned to the positive class) in various places in the app:

lhaviland_2-1614625276620.png

More information for DataRobot users: search in-app Platform Documentation for Additional, then locate information for “Positive class assignment (binary classification only)." For threshold information, search for ROC Curve and locate the "Prediction threshold" section.

Can I define the optimization metric myself?

DataRobot chooses from a comprehensive set of metrics and recommends one well-suited for your data. If desired, you can change the selection from the Advanced Options > Additional tab after EDA1 completes (as shown in the image below). If there is another metric you would like to see implemented in DataRobot, you can contact Support or your CFDS.

lhaviland_0-1614624569368.png

More information for DataRobot users: search in-app Platform Documentation for Optimization metrics.

Do I have to fix class imbalance on the dataset before loading my data into DataRobot?

No. Class imbalance is an issue only if we evaluate models using simple metrics like percentage accuracy. To address this, DataRobot automatically optimizes models for objectives that are both aligned with the project metric and robust to imbalanced targets. See the FAQ "How does DataRobot handle imbalanced data (class imbalance)?"

How do I force a feature to have a monotonic relationship with the target?

To do this, create a feature list containing the features you want to be monotonically increasing (and a separate list for features that you want to be monotonically decreasing, if needed).

Then, specify those feature lists in the Advanced Options > Feature Constraints tab. These feature lists will be a subset of the feature list you use for modeling. You can also create lists and retrain models from the Leaderboard after the initial model run.

Once models are built you can expand a constrained model and, from the Describe > Constraints tab, review the features that were constrained.

lhaviland_3-1614628583107.png

More information for DataRobot users: search in-app Platform Documentation for Monotonic constraints or Monotonic modeling.

What are EDA1 and EDA2?

Exploratory data analysis, or EDA, is DataRobot's approach to analyzing datasets and summarizing their main characteristics. It consists of two phases:

  • EDA1 describes the state of your project after data finishes uploading. It provides summary statistics based on up to 500MB of your data. If the dataset is under 500MB, it uses the entire dataset; otherwise, it uses a 500MB random sample. This phase determines feature type, summary statistics, and frequency distribution for top 50 items, and also identifies informative features.
  • DataRobot calculates EDA2 on the portion of the data used for EDA1, excluding rows that are also in the holdout data (if there is a holdout) and rows where the target is "N/A." DataRobot also does additional calculations on the target column using the entire dataset, recalculating summary statistics and computing ACE scores.

More information for DataRobot users: search in-app Platform Documentation for Overview and EDA then locate “Understanding Exploratory Data Analysis (EDA).” 

What data partition is used in the histograms on the Data page?

DataRobot creates a histogram for each feature when EDA1 completes and displays it on the Data page.These histograms use all of the data, up to 500MB. For datasets larger than 500MB, a random 500MB sample is used. For histograms produced by EDA2 (i.e., after you have told DataRobot what the target variable is), all the data except for the holdout (if any) and any rows with missing target values are used.

What data partition is used in the histograms on the Data tab.png

More information for DataRobot users: search in-app Platform Documentation for Data tab, then locate information for "Working with the Histogram chart." 

What does DataRobot do if there are missing values in my target?

DataRobot handles missing values differently, depending on the model and/or value type. It recognizes special NaN values and reports missing values in several of the visualizations, starting with the EDA1 report on the Data page. Additionally, DataRobot runs a check for (and reports) “disguised missing values,” the term applied to a situation when a value (for example, -999) is inserted to encode what would otherwise be a missing value.

lhaviland_0-1610396977916.png

More information for DataRobot users: search in-app Platform Documentation for Data Quality Assessment, or search for Modeling process details and then locate information for "Handling missing values."

Can I share my project for others to view and/or work on?

Yes, you can share a project with other active accounts in your organization. The recipient can be granted owner (read-only access), editor (read/write access), or consumer (read/write/administrator) privileges.

You can share the active project from the Project share tool:

lhaviland_0-1610398061366.png

You can share a single project from the Actions menu in the Manage Projects control center:

lhaviland_1-1610398382931.png

Or, also from the Manage Projects control center, you can select multiple projects to share at once:

lhaviland_2-1610398723547.png

More information for DataRobot users: search in-app Platform Documentation for Create and Manage projects or Authentication, roles, and permissions.

How do I delete a project?

Note: Only on-premise users (i.e., non-Managed AI Cloud users) can have projects recovered, so use caution when deleting projects.

Use the Manage Projects control center to see all your projects, then select Delete from the Actions menu for the applicable project.

lhaviland_3-1610399357882.png

lhaviland_0-1610399558121.png

Or, also from the Manage Projects control center, you can select multiple projects to delete at once:

lhaviland_0-1610399651029.png 

More information for DataRobot users: search in-app Platform Documentation for Create and Manage projects. 

Do I have to re-upload my data if I want to start a project over?

No, you can use the Manage Projects control center to view all your projects,

lhaviland_3-1610399357882.png then select Duplicate from the Actions menu next to the applicable project. 

lhaviland_0-1610399916332.png

You have an option to copy just the data, or for non-time series projects, the data and the settings.

lhaviland_1-1610399987918.png

Also, if you know you want to (or may want to) use the same training data source in multiple projects, you should add it to the AI Catalog. Later you can just grab that data source from the AI Catalog when building new models.

lhaviland_0-1614630005582.png

More information for DataRobot users: search in-app Platform Documentation for Create and manage projects or for AI Catalog.

How can I apply weights to my data?

Setting weighting for a feature configures that single feature to be used as a differential weight, indicating the relative importance of each row of data for that feature. Weighting is used when building or scoring a model (for computing metrics on the Leaderboard) but not for making predictions on new data.

Make sure to name the column clearly, for example, "PriorityWeight." You can find this setting in the advanced modeling parameters (Advanced Options > Additional tab) after EDA1 completes.

lhaviland_0-1610401562697.png

Under Additional, you will find Weight. Enter the name of the feature containing the weight information, such as "PriorityWeight" in our example:

lhaviland_1-1610401677202.png

More information for DataRobot users: search in-app Platform Documentation for Additional then locate information for "Additional weighting details.”

How do I set exposures and offsets?

Exposures and offsets are commonly used for insurance loss modeling. Both are both treated as special features in data analysis and prediction. You can add exposures and offsets to your data from the Advanced Options page.

lhaviland_0-1610467609991.png

More information for DataRobot users: search in-app Platform Documentation for Additional, then locate information for “Additional weighting details.”

What do the green "importance" bars represent on the Data tab?

The importance bars show the degree to which a feature is correlated with the target. These bars are based on ACE scores, or "Alternating Conditional Expectations" scores. ACE scores are capable of detecting non-linear relationships with the target, but as they are univariate they are unable to detect interaction effects between features. Importance is calculated using an algorithm that measures the information content of the variable; this calculation is done independently for each feature in the dataset.

lhaviland_1-1610467895152.png

More information: see the paper, “Estimating Optimal Transformations for Multiple Regression Using the ACE Algorithm”.

For DataRobot users: search in-app Platform Documentation for Modeling process details, then locate information for “Importance bars.”

What do the feature name prefix tags ("few values," "duplicate," etc.) mean?

These informational tags identify feature characteristics discovered by DataRobot during EDA1. Features with these tags are deemed uninformative, and the gray text in front of each feature name describes the reason that feature was found to be uninformative. These features are excluded from the list of Informative Features that DataRobot creates.

lhaviland_1-1610410748893.png

More information for DataRobot users: search in-app Platform Documentation for Feature lists and then locate information for "Data page informational tags.”

Does DataRobot detect reference IDs in datasets?

DataRobot attempts to detect reference IDs (unique sequential numbers) in datasets. If found, these features will be labeled with an informational tag in the Data page.

For smaller datasets (typically those with fewer than 2000 records), attempting to automatically identify Reference ID columns can lead to false positives (i.e., incorrectly labeling columns as Reference ID). Therefore, especially with smaller datasets, you should manually preprocess the data to remove reference IDs or create a feature list that excludes them.

lhaviland_0-1610411944553.png

More information for DataRobot users: search in-app Platform Documentation for Feature lists and then locate information for "Data page informational tags."

What do the histograms in the Data page represent?

DataRobot creates a histogram for each feature when EDA1 completes and displays it on the Data page. The histogram shows the number of rows of data that have a specific feature value.

  • For numeric features, the values are grouped into ranges (bins), and the histogram shows the number of rows for which the feature has a value within the range of that bin.
  • For categorical features, the height of the bar indicates the number of rows of data which have that feature value.

lhaviland_0-1610415184098.png

What does the small 'i' symbol in the feature list signify?

This symbol indicates that the feature has been transformed—either by you or automatically by DataRobot. For more information about transformations, see the FAQ Is it possible to do manual feature transformation in DataRobot?.”

What does the small i symbol.png

Data preparation is a large part of my job. How does DataRobot help me with these tasks?

DataRobot starts by performing EDA1 on your data. This includes activities such as encoding variables, cleaning up missing values, transforming features, searching for interactions, identifying non-linearities, and so forth. Preparation tasks such as merging multiple data sources into a single dataset can be accomplished with the DataRobot Paxata data prep tools.

After EDA1 runs, data can be manually prepared using DataRobot’s inbuilt SparkSQL engine and interface. If you want to use a code-free interface instead, DataRobot Paxata provides robust point-and-click data preparation capabilities.

Lastly, DataRobot offers unique Feature Discovery capabilities that take care of joining your primary table with many secondary tables, but creates aggregates, performs text mining, and utilizes a variety of other approaches to bring the best features possible into the modeling process.

More information for DataRobot users: search in-app Platform Documentation for Overview and EDA. Also, you can find information for DataRobot Paxata data prep in the DataRobot Paxata Official Cloud Documentation.

How does DataRobot handle Natural Language Processing (NLP)?

DataRobot supports a wide array of Natural Language Processing (NLP) tasks. When text fields are detected in your data, DataRobot automatically detects the language and applies appropriate preprocessing. This may include advanced tokenization, data cleaning (stop word removal, stemming, etc.), and vectorization methods. DataRobot supports n-gram matrix (bag-of-words, bag-of-characters) analysis as well as word embedding techniques such as Word2Vec and fastText with both CBOW and Skip-Gram learning methods. Additional capabilities include Naive Bayes SVM and cosine similarity analysis. For visualization, there are per-class word clouds for text analysis. DataRobot is continuously expanding the NLP capabilities.

More information for DataRobot users: search in-app Platform Documentation for Coefficients (and preprocessing details).

Can I control how to group or partition my data for model training?

Yes. By default, DataRobot splits your data into a 20% holdout (test) partition and the remaining 80% over five-fold cross-validation (training and validation) partitions.

After loading data and selecting a target, you can change these values from the Advanced Options > Partitioning tab. From there, you can set the method, sizes for data partitions, number of partitions for cross-validation, and the method by which those partitions are created.

The default partitioning method is “random” for regression and “stratified” for classification, but other appropriate partitioning methods are possible. For time-dependent data, you can select Date/Time partitioning (aka Out-of-Time Validation or OTV). Column-based partitioning (Partition Feature) or Group partitioning (Group) can be used to create a more deterministic partitioning method.

lhaviland_0-1610418196377.png

More information for DataRobot users: search in-app Platform Documentation for Partitioning and model validation or Data partitioning and validation.

How does DataRobot handle imbalanced data (class imbalance)?

DataRobot has a number of guardrails to make sure that imbalanced data is treated appropriately. One guardrail is a set of metrics which are robust even when the target variable is imbalanced. Some of those metrics are: LogLoss and the Matthews Correlation Coefficient (MCC). You can find the LogLoss metric and Max MCC in the Metric dropdown menu on the Leaderboard. MCC is also found in the ROC Curve tab.

How does DataRobot handle.png

For more on how DataRobot handles imbalanced data, see this Imbalanced Data community post. Also, see "MCC" in Wikipedia to learn more about that coefficient.

More information for DataRobot users: search in-app Platform Documentation for Optimization metrics, then locate information for "LogLoss / Weighted LogLoss."

What happens to my deployment if I delete a model that it is using?

DataRobot will not allow you to delete a model that is deployed.

Can I derive a new feature in DataRobot?

Yes, you can create new features that are derived from others in your data. See the FAQ "Is it possible to do manual feature transformation in DataRobot?"

What is the default partitioning used in DataRobot?

By default, DataRobot splits the data into 20% holdout (test) and 80% over five-fold cross-validation (training and validation). These values can be changed in the Advanced Options > Partitioning page. For more information, see the FAQ "Can I control how to group or partition my data for model training?"

DataRobot is suggesting regression, but can I force it to do classification?

If the project is classified as regression and eligible for multiclass conversion, you can click the Switch To Classification link below the target entry box. This changes the project to a classification project, and DataRobot will interpret values as classes instead of continuous values.

If the number of unique values falls outside the allowable range, i.e., more  than 100, the Switch To Classification link is not enabled.

lhaviland_0-1610475166516.png

How do I delete a model?

From the Leaderboard: Select the model you wish to delete by clicking the box to the left of the model name, and then click Menu from above the list.

lhaviland_1-1610475590596.png

Clicking Menu will open a dropdown menu. Click Delete Selected Model.

lhaviland_2-1610475640439.png

Confirm the deletion in the displayed message.

Pro tip: You can delete all, or multiple, models at the same time by clicking Model Name & Description. Deselect any that you do not want to delete and then use the same Menu process to delete all selected models.

lhaviland_3-1610475950049.png

What do yellow triangle warnings indicate in the Data tab?

Upon uploading data, DataRobot will automatically detect and identify common data quality issues. The Data Quality Assessment report displays the data quality issues, for example:

lhaviland_1-1610477423947.png

And the Data Quality column in the Project Data table identifies the quality issues detected for the related features. You can hover over a yellow triangle to see the related quality issues, such as Target leakage or Outliers:

lhaviland_2-1610477581826.png

You can see additional information in this Data Quality community post.

More information for DataRobot users: search in-app Platform Documentation for Data Quality Assessment.

Do I have to use the UI, or can I interact programmatically?

You are not limited to using the DataRobot UI; DataRobot also provides a REST API. The UI and API provide nearly matching functionality.

Additionally, the DataRobot R and Python clients provide a subset of what you can do with the full API. If you find functionality that is not exposed via the API, DataRobot wants to know so that developers can prioritize that feature. (You can comment here on this article or tell your DataRobot representative.)

More information: DataRobot Python client documentation or DataRobot R client documentation.

More information for DataRobot users: see the in-app API Documentation. (The API documentation is also available from the DataRobot Support site.)

lhaviland_0-1610478751673.png

Does DataRobot integrate with Excel?

Yes, DataRobot-licensed users can install an Excel add-in to help harness the power of DataRobot within a familiar Excel environment. The DataRobot Excel Add-In supports Microsoft Excel client-installed (not cloud-based) Windows versions 2010 through Office 365.  The add-in helps orchestrate model training, validating, and selection, and enables you to deploy models to a dedicated prediction server, leverage model monitoring, and get predictions.

More information for DataRobot users: search in-app Platform Documentation for Excel Add-in.

Does DataRobot integrate with Alteryx?

Yes, DataRobot Tools enable you to create projects and make predictions without leaving the Alteryx interface.

You can download Alteryx from https://s3.amazonaws.com/datarobot-public-external-connectors/DataRobotTools.yxi. Once the download completes, double-click the file and follow the instructions to install.

More information for DataRobot users: search in-app Platform Documentation for Tools for Alteryx.

Does DataRobot integrate with Tableau?

Yes, the DataRobot extensions for Tableau, downloadable from the Tableau Extensions Gallery, are configured to work with DataRobot Managed AI Cloud (i.e., app.datarobot.com). If your organization runs On-Premise AI Cluster, Private AI Cloud, Hybrid AI Cloud, or EU Managed AI Cloud, you must change the extension configuration to work with your deployment.

More information for DataRobot users: search in-app Platform Documentation for Modifying the Tableau Extension URL.

Does DataRobot have ETL Capabilities?

DataRobot has built-in capabilities to extract data from over 15 of prominent databases, data warehouses, and data lakes. This data can be transformed with DataRobot’s SparkSQL engine. With the acquisition and integration of Paxata, DataRobot has expanded its ETL (Extract-Transform-Load) capabilities.

See the DataRobot press release, "DataRobot Acquires Paxata to Bolster its End-to-End AI Capabilities."

Also, you can find information from DataRobot Paxata Official Cloud Documentation. If you have any questions about Paxata and data prep, please ask them here.

What is the difference between prediction and modeling servers?

Modeling servers power all the analysis you do from the UI and from the R/Python clients. Modeling worker resources are typically used to build models, hence they are called "modeling workers."

Prediction servers are used solely for making predictions and handling prediction requests on deployed models. These separate, stand-alone resources ensure that a queue for different request types and worker types doesn't become a bottleneck in your AI processes. If your deployed model makes real-time predictions, using a dedicated prediction server will ensure its performance.

More information for DataRobot users: search in-app Platform Documentation for UI prediction options or, for non-Managed AI Cloud, Standalone Prediction Server.

How does DataRobot interact with SAS? What can I do with my SAS models?

DataRobot can import a SAS file directly (*.sas7bdat). You can also call DataRobot via the API from within SAS by using Proc HTTP.

What file types can DataRobot ingest?

DataRobot can ingest text, Excel, SAS, and various zipped files. Supported file formats are listed at the bottom of the new project page:

lhaviland_0-1610480886279.png

Dataset requirements are dependent on the type of project you're creating, such as AutoML, time series, or Visual AI.

More information for DataRobot users: search in-app Platform Documentation for Dataset requirements.

What sources can DataRobot ingest from?

DataRobot can ingest data from a JDBC-enabled data source, S3, Azure Blob, Google Cloud Storage, URL, Hadoop Distributed File System, a local file, or from the DataRobot AI Catalog.

lhaviland_1-1610481209735.png

More information for DataRobot users: search in-app Platform Documentation for Non-catalog import methods and AI Catalog.

Version history
Last update:
‎03-01-2021 04:12 PM
Updated by:
Contributors