Interactive Lessons for DataRobot Users
Learning paths are designed to help DataRobot users learn about our platform to create and deploy successful machine learning models. Self-paced videos and hands-on activities guide you along the way. You can find links to all Learning Center material in the index.

What's your path?
Have a look at the learning path content related to your role and interest in machine learning: Data Scientist, Developer, Business Analyst.

Knowledge Base Articles

Congratulations on choosing to be a DataRobot user! As a new user it is very likely that you will navigate our platform and work through an AI use case in a structured sequence that we call it a 'learning path’ that replicates the common stages that our users go through—from having a dataset, to building a model, and finally to delivering business value. Instructions We've provided a series of videos, text, and interactive exercises that give you the fundamentals for starting to use DataRobot. The videos visually demonstrate how to navigate our platform at various stages in a typical use case workflow. The interactive exercises are meant for data scientists to 'learn through doing.' We suggest you work through all the videos on this page in the order indicated here. For each section below there are exercises that you can work through to test your learning. Detailed solutions are provided at the bottom of each exercise page so you can check your work and even learns new tips. The typical viewing time for the videos is indicated so you can plan accordingly. There are nine sections, each with a short description and learning content. Introduction Use Case Demonstration Importing Data Exploratory Data Analysis Modeling Options Comparing Models Investigating Models Deploying Models API Access (R/Python) 1. Introduction (4 minutes) These exercises touch upon questions commonly asked about DataRobot. Types of Data Science Problems that DataRobot Addresses—Video Is DataRobot a Black Box?—Video What are the Deployment Options with DataRobot?—Video 2. Use Case Demonstration (12 minutes) Overview of how to use DataRobot, starting with a dataset all the way through to scoring new data. Hospital Readmissions (Classification)—Video 3. Importing Data (8 minutes) This section covers how to pull data into DataRobot for modeling as well as how much data preparation DataRobot requires for modeling. Importing Data Overview—Video Automated Feature Engineering in DataRobot—Video Exercises for Importing Data—Questions 4. Exploratory Data Analysis (8 minutes) This path shows how to explore your data while understanding the automation and guardrails DataRobot has in place. Feature Lists—Video Target Based Insights—Video Exercises for Exploratory Data Analysis—Questions 5. Modeling Options (5 minutes) This path focuses on the processes for modeling setup (such as partitioning) that precede the building of models. Modeling Options Overview—Video Exercises for Modeling Options—Questions 6. Comparing Models (7 minutes) DataRobot’s automation builds many models. This section explains tools for comparing models. Evaluating the Leaderboard—Video Exercises for Comparing Models—Questions 7. Investigating Models (20 minutes) DataRobot offers many tools for evaluating your model and for explaining how the model works. Model Insights—Video Describing and Evaluating Models —Video Understanding Models Overview—Video Exercises for Investigating Models—Questions 8. Deploying Models (6 minutes) Deployment is a critical component to gaining real value from a model. DataRobot offers many ways to deploy a model. Deployment—Make Predictions Tab—Video Deployments Dashboard—Video Using the API—Video Exercises for Deploying Models—Questions 9. API Access (R/Python) (15 minutes) DataRobot is available to use programmatically through our API. Advanced data scientists prefer this approach for integration with other data science tools as well as for setting up automation pipelines. This set of resources is intended to introduce you to the DataRobot API.  DataRobot API Python Client —Text DataRobot API R Client—Text Introduction to a Model Factory—Text Exercises for API—Text Next Steps Congratulations! If you have made it this far, you should have a working AI use case that you can use to address a critical business need. However, this is just the beginning of your DataRobot journey. Should you need additional content to answer specific questions, feel free to search through the various labels in DataRobot Community. Also use the 'search' functionality to identify advanced topics as you become an expert user of the DataRobot platform. Of course, should you have additional questions, feel free to ask them in the DataRobot Community as discussions or replies to related learning posts; DataRobot experts will reply with the answers to keep you moving forward on your journey!
View full article
Congratulations on choosing to be a DataRobot user! As a new user it is very likely that you will navigate our platform and work through an AI use case in a structured sequence that we call it a 'learning path’ that replicates the common stages that our users go through—from having a dataset, to building a model, and finally to delivering business value. Instructions We've provided a series of training modules with interactive exercises that give you the fundamentals for starting to use DataRobot. They demonstrate how to navigate our platform at various stages in a typical use-case workflow. The interactive exercises are meant for you to 'learn through doing.' There are five sections with approximate times that you should set aside to complete. Because they are designed to go at your own pace, you may choose to spend more or less time in each module. DataRobot for the Business Analyst (15 minutes) Exploratory Data Analysis (20 minutes + Activity) Predictions with DataRobot (30 minutes + Activity) Predictive Insights with DataRobot (20 minutes + Activity) Final Exercise Next Steps Congratulations! If you have made it this far, you should have a working AI use case that you can use to address a critical business need. However, this is just the beginning of your DataRobot journey. Should you need additional content to answer specific questions, feel free to search through the various labels in DataRobot Community. Also, use the 'search' functionality to identify advanced topics as you become an expert user of the DataRobot platform. Of course, should you have additional questions, feel free to ask them in the DataRobot Community as discussions or replies to related learning posts; DataRobot experts will reply with the answers to keep you moving forward on your journey!
View full article
This article summarizes how to solve a classification problem with DataRobot. Specifically, you’ll learn about importing data, exploratory data analysis, target selection, as well as modeling options, evaluation, interpretation, and deployment. For this example, we are using a dataset from a readmissions use case where a hospital is trying to predict whether or not a patient is going to be readmitted within 30 days of a diabetic event. The hospital wants to be able to predict this so that they can prevent themselves from discharging patients too early. This is an historical dataset with a known outcome for our target feature. Within this dataset, different rows represent patients, and columns (or features) represent information about those patients. Some of these columns represent demographic features while others represent clinical features. The target column, “readmitted,” is a binary true/false variable, and provides us with a binary classification problem. Figure 1. Snapshot of the training dataset Importing Data Figure 2. Data import options There are five ways to get data into DataRobot: Import data via a database connection using Data Source. Use a URL, such as an Amazon S3 bucket using URL. Connect to Hadoop using HDFS. Upload a local file using Local File. Create a project from the AI Catalog. Exploratory Data Analysis After you import your data, DataRobot will do an exploratory data analysis (EDA). This gives you the means, medians, unique, and missing values for each feature in your dataset. If you want to look at a feature in more detail, simply click on it and a distribution will drop down. Figure 3. Exploratory Data Analysis Target Selection When you are done exploring your features, it is time to tell DataRobot what the target feature is. You do this simply by scrolling up and entering it into the text field (as indicated in Figure 4). DataRobot will identify the problem type and give you a distribution of the target. Figure 4. Target Selection example Modeling Options At this point, you could simply hit the Start button to run Autopilot; however, there are some defaults that you can customize before building models. For example, under Advanced Options > Advanced, you can change the optimization metric: Figure 5. Optimization Metric Then also, under Partitioning, you can also change the default partitioning: Figure 6. Partitioning Options Once you are happy with the modeling options and have pressed Start, DataRobot creates 30–40 models; it does this through a process of building something called blueprints (see Figure 7). Blueprints are a set of preprocessing steps and modeling techniques specifically assembled to best fit the shape and distribution of your data. Every model that the platform creates contains a blueprint. Figure 7. Blueprint example Model Evaluation The models that DataRobot created will be ranked on the Leaderboard (see Figure 8). You can find this under the Models tab. Figure 8. Leaderboard example You can find a set of the evaluation metrics typically used in data science under Evaluate > ROC Curve (Figure 9). Here you can find a Confusion Matrix, ROC Curve, and Prediction Distributions. Figure 9. Evaluation tools Model Interpretation Once you have evaluated your model for fit, it is time to take a look at how the different features are affecting predictions. You can find a set of interpretability tools in the Understand division. Feature Impact allows you to see which features are most important to your modeling (Figure 10). Figure 10. Feature Impact example This is calculated using model-agnostic approaches. You can do a feature impact analysis for every model that DataRobot creates. You can also examine how these features are affecting predictions using Feature Effects (shown in Figure 11), which is also in the Understand division. Below, you can see an example of how the number of in-patient visits increases the likelihood of readmission. This is calculated using a model-agnostic approach called partial dependence. Figure 11. Feature Effects example Feature Impact and Feature Effects show you the global impact of features on your predictions. You can find how features are impacting your predictions locally under Understand > Prediction Explanations (Figure 12). Figure 12. Prediction Explanations example Here you will find a sample of row-by-row explanations that tell you the reason for the prediction, which is very useful for communicating modeling results to non-data scientists. Someone who has domain expertise should be able to look at these specific examples and understand what is happening. You can get these for every row within your dataset. Model Deployment There are four ways to get data out of DataRobot under the Predict division. The first is to use the GUI in the Make Predictions tab to simply upload scoring data and compute directly in DataRobot (Figure 13). You can then download the results with the push of a button. Customers usually use this for ad-hoc analysis or when they don’t have to make predictions very often. Figure 13. GUI Predictions You can create a REST API endpoint to score data directly from your applications in the Deploy tab (shown in Figure 14). An independent prediction server is available to support low latency, high throughput prediction requirements. You can set this up to score your data periodically. Figure 14. Create a Deployment Through the Deploy to Hadoop tab (Figure 15), you can deploy to Hadoop. Users who do this generally have large data and are using Hadoop. Figure 15. Hadoop Deployment Finally, using the Downloads tab, you can download the scoring code to score your data outside of DataRobot (shown in Figure 16). Customers who do this generally want to score their data off of a network or at a very low latency. Figure 16. Download Scoring Code
View full article
This article gives a quick overview of the different ways you can import your data into the DataRobot platform. Specifically, we will cover dragging files onto user interface, searching through local file browser, and fetching from a remote server, local JDBC-enabled databases, or local Hadoop file system. We will also briefly describe the DataRobot AI Catalog. This video explains how to import data into DataRobot. To get started with DataRobot, you will log in and load a prepared training dataset. DataRobot currently supports .csv, .tsv, .dsv, .xls, .xlsx, .sas7bdat, .bz2, .gz, .zip, .tar, and .tgz file types. These files can be uploaded from a local file system or from a URL. DataRobot supports ingest from Amazon S3, Azure Blob Storage, and Google Cloud Storage. It also supports Hadoop/HDFS and can read directly from a variety of enterprise databases via JDBC as shown in Figure 1. Figure 1. Ways to import data into the DataRobot platform DataRobot supports any database that provides a JDBC driver; this means most databases in the market today can connect to DataRobot. Drivers for Postgres, Oracle, MySQL, Amazon Redshift, Microsoft SQL Server, and Hadoop Hive are most commonly used. DataRobot also offers the ability to import data from the DataRobot AI Catalog (See Figure 1), a centralized data store tightly integrated into the DataRobot platform.   Once you have uploaded the desired file, DataRobot immediately starts reading the raw data and kicks off a process called Exploratory Data Analysis (EDA). This process detects the data types and shows the number of unique, missing, mean, median, standard deviation, minimum, and maximum values (See Figure 2).   Figure 2. Results of the DataRobot EDA process More Information If you’re a licensed DataRobot customer, search the in-app documentation for Importing Data.
View full article
This article explains some of the automated feature engineering techniques in DataRobot. The first step for modeling is to ensure your data is all in one table for DataRobot. Once this is done, DataRobot can perform its automated feature engineering. DataRobot makes changes to features in the dataset based on data type: For the numeric features in your data, DataRobot will automatically perform an imputation step and even create a flag for what was imputed. DataRobot does various scaling transformations, such as ridit, standardization, squaring, and log transformation. It will also create features based on ratios and differences of numerical features. For date features, DataRobot will start generating additional features, such as day of the week, day of the month, etc., based on the original date field. For the categorical features, DataRobot will try multiple different techniques, including one-hot encoding, ordinal encoding, and advanced techniques like credibility or target encoding. For text features, DataRobot will try many different techniques. A common technique is using TF-IDF (term-frequency, inverse document frequency). Other text approaches include using ngrams, character grams, or word-embedding techniques such as word2vec or fasttext. If you have multiple columns of text, DataRobot will generate features that look at the cosine similarity between them. DataRobot text processing works across many languages including common languages such as English, Japanese, French, Spanish, Chinese, and Portuguese. Attachment For more information on some of DataRobot's tools for preparing data for machine learning, see the PPTX file attached to this article.
View full article
These are ideas for exercises/labs to help a data scientist learn the fundamental skills with DataRobot. Dataset attached: For this lab, use the Lending Club Guardrails dataset attached to this exercise. These exercises are intended to ensure you are able to import data into DataRobot and navigate DataRobot projects.  Setup Load DataRobot in your Chrome web browser. Have the Lending Club Guardrails dataset downloaded to your computer. Exercises Download and import the dataset into DataRobot. Using theDataRobot in-app documentation, find the requirements and formats that DataRobot supports for datasets. Rename the default filename of your project.   Congratulations on completing these exercises! When you're ready, click Spoiler to reveal the solutions and check your work.   Solutions After downloading the data onto your computer, you can import it into DataRobot by: Drag and drop it onto the DataRobot platform. Use the Local File button that open up the file browser for you to locate the file on your computer. One way to access the DataRobot documentation is by clicking the booklet icon located at the top right corner of your DataRobot user interface. You can then search for the term formats to find the information. The location of the requirements and formats DataRobot supports for datasets are found at the following link: https://app.datarobot.com/docs/modeling/load/load-data/file-types.html (or equivalent path). You can rename the default filename of your project by double-clicking on the name that DataRobot gave this project. This will create a text box that you can edit. Importing data, documentation, and the ability to manage projects are indicated below: Click on the documentation book icon and then in the search box type 'formats' to get a link to the document that contains the requirements and formats DataRobot supports for datasets.   4. Manage your project, by renaming it from the default filename. Click the folder icon in the upper right corner to open the Projects dropdown and then click the Manage Projects link. Then click on the hamburger icon as indicated in the picture below. You will then see a Rename Project option.  
View full article
Feature lists control the subset of features that DataRobot uses to build models. After you've uploaded your data, DataRobot creates a few feature lists by default. Figure 1. Default DataRobot feature list Raw Features is a list of all the features present in your dataset when uploaded. Additionally, it identifies any features that DataRobot knows will not be informative for modeling purposes. Some features are determined to be non-informative because they have too few values, such as categorical features that may only have a single value or those that contain duplicate values. Other reasons for non-informative features include: those that are reference IDs (such as row identifiers where every value is unique), features that contain empty values, or features that are derived from the target and so are highly correlated to the target or have target leakage. (We have other videos that explain target leakage in greater detail.) Figure 2. DataRobot identifies non-informative features The other feature list that DataRobot creates by default is called ‘Informative Features.’ This is a subset of the raw features uploaded with the non-informative features removed. DataRobot also does some feature creation, such as creating features related to date type features (for example, day of the week and day of the month). Creating feature lists You can create a feature list by selecting features manually (via the check boxes to the left of the feature names), selecting Create Feature List, and giving the new feature list a name such as “My List 1.” Figure 3. Creating a new feature list Some of the same feature list tools are available by selecting Menu. Within the displayed menu are all the feature lists, both the default ones and any custom ones; you can select a list from this menu. There are also some handy tools to select features that are of a given variable type, such as only the categorical variables or those are only numbers. Figure 4. All feature lists shown under the Menu Management of feature lists is provided via the Feature Lists link. This presents all the created feature lists, including additional information such as the list name and description, the number of models created using that list, the creation date, etc. It also provides the ability to present the composite feature variables in a list, edit the name and description of a list, and rerun Autopilot with a specific feature list. Figure 5. Feature Lists panel Both DataRobot-generated feature lists and user-created lists may be used to run Autopilot. Figure 6. Rerunning Autopilot with a selected feature list After the models are built, the Data tab shows two new default lists: DR Reduced Features which is based on a specific model, and Univariate Selections, which contains those features highly correlated with our target variable (the correlation is shown with a green bar in the Importance column). DR Reduced Features is a subset of the top features from the best performing non-blender model, indicated by the green bar in the Importance column. Figure 7. Importance column shows the significance of features to the target The Feature List & Sample Size information in the Leaderboard (Models tab) shows the feature list used to train a model. It is possible to change the feature list for any particular model and rerun the model with a different selected feature list. Figure 8. Shows feature list used to train the model You can also create a feature list from a subset of features by navigating to the Understand > Feature Impact tab for a given model. You can select a number of the top most important or impactful features to use in the new feature list. Similar to the Data tab, there is an option to create a feature list. This list can be used to rerun the particular model with the new feature list, or to rerun the entire Autopilot process with this list (or any particular feature list selected). All the default and custom feature lists are displayed. Figure 9. Creating a new feature list with some top features Likewise, because feature lists are available throughout the entire project you can find the new feature list in the Data tab under Custom Feature Lists. Figure 10. Custom feature lists are available across the project Figure 11. Autopilot can run on custom feature lists
View full article
This article explains feature importance, target leakage detection, and the Feature Association Matrix. These are all calculated after the target has been selected and the Start button is pressed, as shown in this Target-Based Insights video. Feature Importance Feature Importance is a column that is highlighted from the Data page and shows us the relationship between our features and the target, as shown in Figure 1. The feature importance is analogous to a correlation and is calculated for using an algorithm called Alternating Conditional Expectations. Figure 1. Feature Importance DataRobot shows the relationship between the target and a feature using an orange line as shown in Figure 2. For this numerical feature, DataRobot shows that when the number of inpatients is between 4 and 6, there is about an 80% likelihood of readmissions.  Figure 2. Relationship between the number of inpatients and the likelihood of readmission Feature importance is available for text features as shown in Figure 3. The size of the letters is the frequency of the words, while the color refers to the strength of the relationship to the target. In this example, words in red have a much higher likelihood of readmittance than the words in blue. Figure 3. Word Cloud for Text Features Leakage Detection If you see a red or a yellow indicator then DataRobot has identified this feature as target leakage, as shown in Figure 4. DataRobot may then remove the feature from the list of features, so you see an informative feature with target leakage removed. This automatic removal is one of the guardrails DataRobot has in not only identifying target leakage, but also acting on it. Figure 4. Leakage Detection Feature Association Matrix The Feature Association Matrix shows the relationship between numerical and categorical information as shown in Figure 5. The colors indicate the strength of the association. The different colors here represent different clusters or groups of features that DataRobot has detected and are somewhat associated with each other. It is possible to sort this as well as run this analysis on different feature lists. Figure 5. Feature Association Matrix The Feature Associations tab also allows you to look at the relationship between any two features as shown in Figure 6. Figure 6. Pairwise relationships More Information DataRobot users: for more information on feature details, leakage detection, and feature associations, search in-app documentation for Curate data.
View full article
These exercises focus on ensuring you can explore your data while understanding the automation and guardrails DataRobot has in place. Part I. Exploring your Data The exercises in this section focus on creating feature lists, DataRobot’s data quality detection, and feature transforms. Dataset attached: For this lab, use the Lending Club Guardrails dataset attached to this exercise. Setup Import the Lending Club Guardrails dataset into DataRobot. This will start the Exploratory Data Analysis process. Wait for a minute and then answer the questions in the Exercises section. Exercises DataRobot has some automation around both identifying useful features and creating new features. This set of questions aims to help you understand the decision DataRobot is making for feature selection. For the Feature List named Raw Features: how many total features does it contain? Looking at All Features: how many features does DataRobot suggest to exclude? Looking at All Features: how many features has DataRobot created? How many features are included in Informative Features? Sometimes it is necessary to convert a numeric feature to a categorical feature. DataRobot tries to predict the type of a feature, but sometimes it doesn’t know. For instance, a dataset may encode a person’s occupation with a number; when this is the case, we would want to treat that feature as a categorical instead of a numeric. Convert ‘delinq_2yrs’ to a categorical feature. One of the most common tasks with DataRobot is creating new feature lists. This is an important part of the iteration to find the best model for your use case. Create a new feature list with everything from Informative Features, except replace the numeric version of ‘delinq_2yrs’ with the categorical version. Name the new feature list The Most Awesome Feature List. ----------  Part II. Target Based Insights The exercises in this section are based on target-based insights. This section covers target leakage detection and exploring the relationships between features in your dataset. Setup To complete these exercises, you will need to: Select the Informative Feature List. Select ‘is_bad’ for the target, set the Modeling Mode to 'Manual,' and click Start. Wait and then press Dismiss on Manual Blueprint Setup. (The exercises are focused on exploring data and not modeling at this point.) Exercises Why is loan status marked as Target Leakage? Did DataRobot create a new feature list that removed the target leakage feature? When you look at the feature association matrix of the Informative Features, what features are associated with each other? In the Feature Association tab, identify your most awesome feature list from the Feature List dropdown. What do you observe in the association matrix? Using the View Feature Association Pairs tab, drill into the relationship between ‘dti’ and ‘loan_status.’ What story does the graph at the bottom of this tab tell you?   Congratulations on completing these exercises! When you're ready, click Spoiler to reveal the solutions and check your work.   Solutions Part I. Exploring your Data This set of questions aims to help you understand the decision DataRobot is making for feature selection. 1. For the Feature List named Raw Features, how many total features does it contain? (26) 2. Looking at All Features, how many features does DataRobot suggest to exclude? (6) 3. Looking at All Features, how many features has DataRobot created? (4) 4. How many features in Informative Features? (23) Converting a numeric feature to a categorical feature. DataRobot tries to predict the type of a feature, but sometimes it doesn’t know. In some datasets, the occupation could be coded with a number. In that case, we would want to treat that feature as a categorical instead of a numeric. 5. Convert 'delinq_2yrs' to a categorical feature. You will do this as a variable transformation. Find the 'delinq_2yrs' feature in the Data page. Select Var Type Transform and then click Create feature to transform it to a categorical feature.   Creating new feature lists 6. Create a new feature list with everything from Informative Features, except replace the numeric version of 'delinq_2yrs' with the categorical version and name it: The Most Awesome Feature List. Watch the video for the solution. --------- Part II. Target Based Insights 1. Why is loan status marked as Target Leakage? This is a case of Target Leakage. Every category in loan status is directly related to the target. DataRobot looks for this type of leakage and identifies it. 2. Did DataRobot create a new feature list that removed the target leakage feature? YES 3. What features are associated with each other? By selecting feature associations, we can see a list on the right side that indicates which features are most strongly associated with each other.   4. Run the association matrix on your Most Awesome Feature List? 5. Drill into the relationship between 'dti' to loan status . . . can you explain the plot? Click on View Association Pairs and then select 'dti' and loan status to see their association. For some values of loan status, you see a wide variation of 'dti.' For In Grace and Late, there is very little data, so there is not the same amount of variation.    
View full article
After you’ve uploaded your data into DataRobot and EDA1 has completed, you're ready to explore your data and set up your project to begin building models. To explore your data, you can either click the link labeled Explore (and your dataset name) at the bottom of the page, or you can simply scroll down. Figure 1. Exploring data You will see a list of all features in the uploaded dataset. The display presents the automatic identification of data types that DataRobot has done. DataRobot supports the following data types: numeric, categorical, dates, percentages, currencies, lengths, and free text. For the numeric data, you see some summary statistics such as min, max, mean, median, and the standard deviation, as well as the number of unique and missing values. Figure 2. List of all features in the dataset You can further explore any feature by clicking on it, which displays a histogram of the data within that feature at selectable levels of bin granularity. The data may also be displayed in the form of the most frequent values or as a table. You can change the data type that DataRobot automatically assigned, such as from numeric to categorical, from categorical to text, etc. Figure 3. Histogram of data for a selected feature To the left of every feature name is a check box that appears when you hover the mouse over it. This allows you to select features to create feature lists (these are discussed in greater detail in other materials). Once the dataset is uploaded, in order to proceed, DataRobot needs to know the target (that is, which feature you want to predict.) You can either hover over a feature and click Use as Target, or you can simply type the name of the feature that you want to use in the text field in the upper left of the screen under What would you like to predict? Once selected, you will see a histogram of the target displayed. Given the data type of the target feature, DataRobot will recognize the type of data science problem as classification or regression. If a suitable date and time feature data is available, DataRobot’s time series option will be available to select. Figure 4. Specify the target feature here Also after the target is selected, a link at the bottom of the page displays Show Advanced Options. This allows you to set a variety of configurations, including the optimization metric to use for modeling, different partitioning schemes, downsampling, and many more. Other materials will discuss these settings in detail, but it is central to note that the default settings provide guardrails enabling less experienced data scientists, engineers, analysts, etc. to proceed with building excellent models without additional understanding or configuration. However, DataRobot does also provide fine grain control for users who would like to specify those settings. Figure 6. Advanced Options for modeling configuration Going back up to the top of the page, you see the Start button. When clicked, this will initiate the modeling process. Underneath the Start button you see Modeling Mode, Feature List, and Optimization Metric. Modeling Mode indicates how to build models, with options Autopilot, Quick, and Manual. This specifies the process and workflow DataRobot uses to build models. Feature List points DataRobot to the set of features to use to train the models. The Optimization Metric is the means by which the model is trained (or optimized); for example LogLoss, RMSE, etc. Figure 7. Initiate model building with selected options After you click the Start button, DataRobot begins the model training. Given the dataset with the type of features present (e.g., text features, categorical, dates, etc.), the type of target, and the type of project, DataRobot will select a subset of models to train, score, and rank, and present these models on the Leaderboard for further evaluation and understanding analysis. Figure 8. Leaderboard with built models DataRobot trains models in a sequence of rounds to provide fast processing; only a portion of the data is used to find the best performing models. After each round, DataRobot selects only those models that perform best to proceed to the next round. Each successive round uses greater amounts of training data, moving towards building the best models with the full training dataset; DataRobot refers to this as a ‘survival of the fittest’ modeling competition.
View full article
These exercises focus on the modeling setup that precedes the building of models, such as partitioning. This section will highlight the defaults that DataRobot provides, and also show the options for advanced data science modeling. Dataset attached: For this lab, use the Lending Club Guardrails dataset attached to this exercise. Setup Start a new project and import the Lending Club Guardrails dataset. The simplest way to start a new project is to click on the DataRobot logo at the top left of the page. Load the dataset and set the target to 'is_bad.' Now, let’s explore the Advanced Options! Exercises Try setting a different partitioning scheme such as TVH? What other optimization metrics are available? How would you set your project for accuracy optimized blueprints?   Congratulations on completing these exercises! When you're ready, click Spoiler to reveal the solutions and check your work.   Solutions After you load the dataset and set the target to is_bad, you can explore the Advanced Options. The image here shows the DataRobot Logo which can create a new project and where Advanced Options is located (it will only be available after a Target is entered under - What would you like to predict?).   Try setting a different partitioning scheme such as TVH? Under partitioning, you can try and set different schemes from more folds or using TVH. Make sure you set this back to 5 fold cross-validation if you want your model to perform the same as the rest of these solutions. What are the other optimization metrics that are available? This will show that LogLoss is the recommended metric, but other metrics but many other metrics are available. We recommend that you stay with our recommendations and suggestions.   How would you set your project for accuracy optimized blueprints? The option for Use accuracy-optimized metablueprint is just below the Optimization Metric. Use this if you are wanting the most accurate models willing to wait a little longer. This approach, for example, will run XGboost models with a lower learning rate and more trees.  
View full article
Now that you've run your project and have many models built from your data, you can evaluate each of those models on the Leaderboard, which is found by clicking the Models on the top menu. Figure 1. Models Firstly, note the number printed to the right of Models: that indicates the number of models that have been built in this project. Clicking Models opens the Leaderboard, which is a list of all the models ranked and presented in order of the the selected performance metric. For example, for binary classification the default metric is logloss. You can click any of the models to expand further information. A menu of the following tools appear: Evaluate, Understand, Describe, Predict, and Compliance. Evaluate—provides model performance information Understand—provides model composition information Describe—provides a description of the model’s blueprint, which is a combination of various preprocessing steps and the machine learning algorithm, along with other data points from pipeline processing. Displayed from left to right, the flow proceeds from ingesting the data at upload, through various preprocessing steps (and possibly some other algorithms), and then into the final machine learning algorithm. (In Figure 2, this algorithm is eXtreme Gradient Boosted Trees Classifier.) Finally, the completed built model is available to generate predictions Predict—provides multiple ways to issue prediction requests and retrieve results Compliance—generates a detailed document describing all of the steps and configurations DataRobot performs in order to provide transparency. Figure 2. Tools to understand and evaluate a model Stepping back up one level from this menu, you can see a number of different items highlighted in gray (Figure 3). Figure 3. Identification badges for a model These identification badges provide several ways to filter the models on the Leaderboard. Information provided by badges depends on the type of machine learning algorithm and may include model and blueprint number, whether coefficients are available, scoring code, and etc. (Each badge is described in detail in the DataRobot in-app documentation.) You can click a badge to filter the Leaderboard by all models for that badge. The Leaderboard comprises three main column sections: Model Name & Description—shows the name of the model and a description as a short list in text of the blueprint steps. Feature List & Sample Size—indicates the feature list the model was trained on and the amount of data used to train it; for example, informative features and 100% of the training data. You can click the feature list and sample size to change it; this will rerun the model with the new selections. Metric <metric name>—identifies related model metric scores for validation and holdout, and cross validation if it was run. The metric may be changed by clicking on the dropdown menu. Figure 4. Leaderboard information for models Next to the name of the model algorithm is an icon, as shown in Figure 5. Figure 5. Model icons Each icon indicates the open source language and/or library used to build the model. For example, Python, R, X Boost, DMTK, Tensorflow and others. There is also an icon for DataRobot, which is our particular implementation of various libraries with adjustments that we've made. Above the Leaderboard table but below the main menu are another set of items in orange text. Figure 6. Tools for viewing models and creating new types of models Menu provides different options available to combine models into ‘blenders’ (a process also called ‘ensembling’). Blenders combine multiple models to mix prediction results into a singular output in one of a variety of different ways. There are also tools to Search for models and Filter the view of models. You can use Add New Model to add a model from the Repository to train and add to the Leaderboard. Export creates the Leaderboard contents in a downloadable file and exports them. Above that is another menu containing Leaderboard, Learning Curves, Speed vs Accuracy, Model Comparison, and Prediction Apps. The Leaderboard being described here, the others are discussed in other materials. Figure 7. Leaderboard tools
View full article
DataRobot’s automation builds many models. These exercises allow you to start to navigate these models as well as build new models based on new feature lists, sample sizes, or based on other models. Dataset attached: For this lab, use the Lending Club Guardrails dataset attached to this exercise. Setup These exercises require a completed Autopilot run with the Lending Club Guardrails dataset. The dataset is attached to this exercise. Load the data, select a target, and leave the default modeling options (including the Informative Features feature list and Autopilot modeling mode). Then, press the Start button. It will take approximately 15 minutes for DataRobot to finish building many models. After that process is completed, you can complete these sets of exercises. Exercises Try to filter your view so you only see the Nystroem Kernel SVM classifier. On the model Leaderboard, why are there three entries for the Nystroem Kernel SVM Classifier? What is different about each entry? Find the RuleFit model and create a new model that is built with 64% of the training data. Create a new RuleFit model that is built on your previously created most awesome feature list (created during the EDA exercises).   Select several different models and create a blender. How is the performance? Go to the Repository and add a Tensorflow model to the Leaderboard. How is the performance?   Congratulations on completing these exercises! When you're ready, click Spoiler to reveal the solutions and check your work.   Solutions     Filter your view so you only see that particular type of model. There are several ways to filter the view from using the Search feature to clicking on the blueprint filter as shown below. On the Leaderboard, why are there three entries for the Nystroem Kernel SVM Classifier? What is different about each entry? They are the same model trained at different sample sizes. This means they were built with different amounts of training data. DataRobot does this as part of the survival of the fittest approach in AutoPilot where we start with many approaches and pare down to a small set of winning approaches. A nice side effect of this is DataRobot can show you learning curves to help you understand the marginal effect of adding more data to your problem. Rebuild the RuleFit model that was built at 64%. Rebuilding a model at different training sizes is a common modeling task. If you like the RuleFit model and the Hotspot insights, it is useful to build the Rulefit using a large amount of your training data. First, find the RuleFit mode:   Next, choose a larger sample size. Run the RuleFit model on your most awesome feature list.  DataRobot makes it easy to retrain a model on a different feature list. Just select a different feature list from the fork icon and DataRobot will build a new model. Create a blender.  Select several models by selecting them through the box. Then click on Menu and choose one of the blender options, e.g., ENET. Blenders typically have better performance and are recommended if prediction speeds can be slower. Go to the Repository and add a Tensorflow model. Click on Repository at the top of the page. Then you can either browse for the TensorFlow model or search for them. To run a model, select the model, as well as the Feature List and Sample Size and then click Run Task.   
View full article
The Insights menu provides several additional graphical representations of model details. Some are model agnostic and applicable to any model or the data as a whole, while others are representations of model details that apply to a particular model that you select. Tree-based variable importance provides a ranking of the most important variables in a model by using techniques specific to tree-based models. Hotspots indicate predictive performance as a set of rules - the rules being combinations of feature values of a subset of important features. Variable Effects illustrate the magnitude of existing and derived features by way of coefficient values. Word Cloud visualizes the relevance of text related to the target variable Anomaly Detection provides a summary table of anomalous results sorted by a scoring of the most anomalous rows. Text Mining, similar to Variable Effects, visualizes the relevancy of words and short phrases, and also by way of coefficient values. Figure 1. Insights menu Now let's take a look at each in greater detail. Tree-based variable importance Tree-based variable importance shows the sorted relative importance of all key variables driving a specific model, relative to the most important feature for predicting the target. In models based on random forests, this can be derived using entropy or Gini calculations, which are based on measurements of impurity or information gain. In the dropdown list shown in Figure 2 are all tree-based models in the project, and each is available to be selected and displayed. This is helpful to quickly compare models. It is useful to compare how feature importance changes for the same model with different feature lists. Generally, we recommend using Feature Impact to understand a model, but tree-based variable importance may provide insights. For example, a feature that is recognized as important on a reduced dataset might differ substantially from the features recognized on a full dataset. Or if a feature is included in only one model out of the dozens that DataRobot builds, it may not be that important. If this is the case, excluding it from the feature set can optimize model building and feature predictions. Figure 2. Tree-based models Hotspots This investigation tool shows hot spots and cold spots which represent simple rules with highly predictive performance either in the direction of the target, which is a hot spot, or in the opposite direction of the target, which is a cold spot. These rules are often good predictors and can be easily translated and implemented as business rules. Note that hotspots are available when you have a rule fit classification or regression model, requiring at least one numeric feature and fewer than one hundred thousand features. In Figure 3, we see the size of the spot, which indicates the number of observations that follow the rule, and the color of the spot, which indicates the difference between the average target value for the group defined by the rule and the overall population. Figure 3. Hot and cold spots, size and color Variable Effects Variable Effects tells us the relevance of different variables, many derived from raw features in the model. The variable effects chart shows the impact of each variable in the prediction outcome. Notably, this chart is useful to display and compare variables via different constant splines from applicable linear models. This is useful to ensure that the relative rank of feature importance across models doesn't vary wildly. If in one model, a feature is regarded to be very important, but in another model it is not very important, then it's worth double-checking both the data set and the model with variable effects. You can sort the variable effects by the dropdown menu at the bottom by coefficient value or alphabetically by feature name. Figure 4. Variable Effects Word Cloud This tool displays the most relevant words and short phrases in a word cloud format. The size of the word indicates its frequency in the dataset and the color indicates its relationship to the target variable. Text features can contain words that are highly indicative of a relationship to the target variable. You can use the word cloud to easily view and compare text based models in the dropdown list, but it's also available in the Leaderboard for a specific model via the Understand division. Figure 5. Word Cloud Anomaly Detection Also referred to as outlier and novelty detection, anomaly detection is an unsupervised method for detecting abnormalities in your dataset. Similar to supervised learning, anomaly detection works on historical data, but is unsupervised in that it does not take the target into account when making predictions. DataRobot does this by simply ignoring the target when building anomaly models. Because you still do enter a target, however, DataRobot also can build accurate non-anomalous models. (Anomaly detection will be discussed in greater detail in a future article.) Figure 6. Anomaly Detection Text Mining Lastly, the text mining chart displays the most relevant words and short phrases in any features detected as text. Like variable effects, you can use the dropdown list at the bottom of the page to sort by coefficient value or alphabetically by feature name. Figure 7. Text Mining
View full article
This article explains evaluation techniques for DataRobot models, including Blueprint, Compliance, Lift Chart, ROC Curve, Prediction Distribution graphs, and Cumulative Lift and Cumulative Gain charts. These are all calculated after the Autopilot has finished running on the readmissions dataset, in this example. Blueprint To find the model blueprint, just click on the model of your interest from the Leaderboard.If you need to navigate to it, select Describe > Blueprint. Figure 1. Blueprint of a Logistic Regression The blueprint is the visual representation of what DataRobot is doing under the hood to build a specific model. In Figure 1, we see the blueprint of a logistic regression model. Each node represents a method or a conceptual task that DataRobot had to complete in order to produce a model. In this example, type-specific transformations are applied to numerical and categorical data and then a logistic regression model is trained. Each section of the blueprint is clickable and you can get references and papers for the specific method. For example, I can select One Hot Encoding and then click DataRobot Model Docs to see the specifics. Figure 2. Clicking on One-Hot Encoding Node Figure 3. One-Hot Encoding DataRobot docs To conclude, here is a blueprint for a more complicated and robust model: Figure 4. Blueprint of an XGBoost model Compliance To find compliance, click on the Compliance division. Figure 5. Compliance division DataRobot automates many critical compliance tasks associated with developing a model and, by doing so, decreases the time-to development in highly regulated industries. You can generate, for each model, individualized documentation to provide comprehensive guidance on what constitutes effective model risk management. To generate the report just click on the Generate Report option and wait for it to finish. Lift Chart To find the Lift chart, click on the model of interest from the Leaderboard,  Figure 6. Leaderboard  and then click the Evaluate division. Figure 7. Lift chart The Lift chart is the first chart shown on the screen. This chart sorts the predictions the model made from lowest to highest and then groups them into bins. The blue and orange lines depict the average predicted and average real probability (respectively) for a particular bin. A good Lift chart would have the orange and blue lines “hugging,” as that would mean your model is making predictions close to the real values. On the bottom of the chart, you see multiple options. You can change the subset of data that the Lift chart was created from, change the number of bins, or enable drill down which will give you the option to see the exact predictions and real values for each bin. ROC Curve To continue exploring how well the model is performing, click the ROC Curve tab. Figure 8. ROC Curve tab The ROC Curve tab provides a rich selection of methods to assess model performance. Common evaluation metrics On the top left you have the absolute values for some common evaluation metrics. Figure 9. Common evaluation metrics As you can see, the model currently exhibits a sensitivity score of 83% combined with a precision score of 51%. Confusion Matrix On the top right-hand corner, we have the Confusion Matrix. Figure 10. Confusion Matrix On a row based level, you have the actually readmitted (top row) and actually not readmitted (bottom row) patients. The left and right columns represent the predicted to be readmitted and predicted to not be readmitted, respectively. ROC Curve On the bottom left of the ROC Curve tab, we have the ROC Curve plot. Figure 11. ROC Curve As far as the ROC Curve plot is concerned, we would want a green line with an arch that almost touches both the Y-axis and a projected upper X-axis. Our current model has an area under the curve (AUC) score of 0.71 while a baseline model will have an AUC score of 0.5. You can see the AUC of the model in the bottom right-hand corner. Prediction Distribution In the middle of the graphs, we have the Prediction Distribution graph. Figure 12. Prediction Distribution Plot The Prediction Distribution graph visualizes the distribution of actual values based on the current probability threshold. All of the predictions above the threshold will be classified as patients likely to be readmitted while the opposite will happen to people who had a low readmittance probability. The current threshold is 0.31 and this is the value that maximizes f1 score. Purple represents people not readmitted while green represents people readmitted into the hospital. Ideally, we would want the purple and green graphs to overlap as little as possible. Cumulative Lift The Cumulative Lift chart is on the bottom right-hand corner. Figure 13. Cumulative Lift chart Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. This means that if we focused on the top x% of patients based on their readmittance probability, then we would have a model that is n times better than baseline. Cumulative Gain The Cumulative Gain chart is on the bottom right-hand corner and it becomes available after clicking Chart Type at top of the Cumulative Lift chart. Figure 14. Cumulative Gain chart The Cumulative Gain chart, depicts how sensitivity on the Y-axis is changing as we focus on the top x% of patients based on their readmittance probability. More Information If you’re a licensed DataRobot customer, search the in-app documentation for Lift Charts and ROC Curve.
View full article
DataRobot offers many Explainable AI tools to aid understanding for your models. Three model-agnostic approaches can be found in the Understand division: Feature Impact, Feature Effects, and Prediction Explanations. This post covers these methods as well as the Word Cloud which is useful in models with text data. Feature Impact To find Feature Impact, select the model of interest in the Leaderboard. Figure 1. Leaderboard Then click the Understand tab. Feature Impact is shown by default. Figure 2. Feature Impact Feature Impact is a model-agnostic method that informs us of the most important features of our model. The methodology used to calculate this impact, permutation importance, normalizes the results, meaning that the most important feature will always have a feature impact score of 100%. One way to understand feature impact is like this: for a given column, feature impact measures how much worse a model would perform if DataRobot made predictions after randomly shuffling that column (while leaving other columns unchanged). If you want to aim for parsimonious models, you can remove features with a low feature impact score. To do this, create a new feature list (in the Feature Impact tab) that has the top features and build a new model for that feature list. You can then compare the difference in model performance and decide whether the parsimonious model is better for your use case. Furthermore, even though it is not that common, features can also have a negative feature impact score. When this is the case, it will appear as if the features are not improving model performance. You may consider removing them and evaluating the effect on model performance. Lastly, be aware that feature impact differs from the importance measure shown in the Data page. The green bars displayed in the Importance column of the Data page are a measure of how much a feature, by itself, is correlated with the target variable. By contrast, feature impact measures how important a feature is in the context of a model. In other words, feature impact measures how much (based on the training data) the accuracy of a model would decrease if that feature were removed. Feature Effects Because of the complexity of many machine learning techniques, models can sometimes be difficult to interpret directly. The Feature Effects insights provide model details on a per-feature basis. Feature Effects can be found by clicking the Feature Effects tab right next to Feature Impact. Figure 3. Feature Effects On the left side of the page, you have the model features ordered by their Feature Impact score, from highest to lowest. By clicking on each one of the features, a partial dependence plot appears on the right-hand side. Feature Effects tells us how the individual changes in the values of a feature affect the target outcome if everything else remains steady. For example, in Figure 3, we see that as the number of inpatient stays increases (X-axis), the probability of being readmitted into the hospital (Y-axis) also increases. There seems to be some diminishing effects though as above 4 to 5 inpatient stays, the probability of being readmitted does not increase significantly. Prediction Explanations Prediction Explanations can be found by clicking on the Prediction Explanations tab right next to Feature Effects. Figure 4. Prediction Explanations After you build models, you can use Prediction Explanations to help understand the reasons DataRobot generated individual predictions. They provide a qualitative indicator of the effect variables have on the predictions, answering why a given model made a certain prediction. By default, DataRobot will give the top three reasons why this prediction was made but you can get a maximum of ten reasons. Note that, in order to view Prediction Explanations for a model, you must first calculate Feature Impact. You can do this from either the Prediction Explanations or Feature Impact tabs. Once the computation completes, DataRobot displays the Prediction Explanations results. The top row of the table in Figure 4 shows that a patient was assigned a 94.7% probability of being readmitted into the hospital. DataRobot Prediction Explanations attribute this to the rather high number of inpatient stays, the patient's weight, and the missing admission type id. This insight is very powerful because DataRobot is providing explanations at the level of individual predictions. You can compute Prediction Explanations for both the training dataset and for a testing dataset that you upload yourself by clicking the Compute & Download option on the top right-hand corner. Word Cloud To find the Word Cloud, click the Insights tab at the top of your DataRobot endpoint. Figure 5. Insights Then click Word Cloud . Figure 6. Word Cloud DataRobot will run multiple natural language processing models for each distinct text feature. To pick a specific model, click on the dropdown menu on the top left corner and you will see all available options (similar to Figure 7). Figure 7. Natural Language Processing Models The Word Cloud itself is a visual representation of the correlation of free-text words with the target column. The blue color means that the appearance of this word decreases the probability of readmittance, and the red color means that this word increases the probability of readmittance. The size of the words informs us of how commonly this word appears in our dataset. From Figure 6, we see that the appearance of words like “kidney," “chronic,” and “manifestations” increase the probability of being readmitted while words such as “anemia” and “hypertension” seem to occur in patients that are less likely to be readmitted. If you want to export the results and get the absolute values, click on the Export button on the left side of the page (as shown in Figure 6). More Information If you’re a licensed DataRobot customer, search the in-app documentation for Feature Impact, Feature Effects, Prediction Explanations, and Word Cloud.
View full article
DataRobot offers many tools for evaluating your model as well as explaining how the model works. These sets of exercises go through some of the tools for evaluating the performance of models, explaining the features driving a model, and even doing advanced feature selection.  Dataset attached: For this lab, use the Lending Club Guardrails dataset attached to this exercise. Setup The exercises require a completed Autopilot run with the Lending Club Guardrails dataset. (If you are continuing from the Comparing Models Exercise, there is no need for additional setup; otherwise, download the attached to this exercise.) Exercises Use the DataRobot modeling documentation to find the different Kernel approximation methods available for the Nystroem Kernel SVM Classifier. How does adjusting the prediction threshold higher affect False Positives and False Negatives for the Nystroem Kernel SVM Classifier? Download all the coefficients for the Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) model. Try some feature selection on the Nystroem Kernel SVM Classifier based on the top features for this model. Create a feature list with the top 10 features ranked by Feature Impact. Run the SVM classifier on this new Feature List. Change the hyperparameters for this model. For instance, change the Approximation Method to “fourier.” Now, rerun the model.   Congratulations on completing these exercises!   When you're ready, click Spoiler to reveal the solutions and check your work.   Solutions   Use the modeling documentation to find the different Kernel approximation methods available for the Nystroem Kernel SVM Classifier. To find the documentation, go to the Describe > Blueprint tab. This view shows the preprocessing and algorithm being used. To view the documentation, click on the box indicating the algorithm. A popup then appears with information on the algorithm and an orange DataRobot Model Docs that brings you to the full documentation. How does adjusting the prediction distribution threshold higher affect False Positives and False Negatives.  Adjusting the threshold for the classifier will affect the distribution of False Positives and False Negatives. In most use cases, there will be a cost for false positives and a separate cost for false negatives. Based on those considerations, the prediction distribution should be set to what is best for your use case. Download all the coefficients for the Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance) model. The coefficients, if available, are found under the Describe > Coefficients tab. To export all the coefficients, click on the Export tab and download a CSV file with all the coefficients. Try some feature selection on the Nystroem Kernel SVM Classifier. Create a feature list with the top 10 features ranked by Feature Impact. Run the SVM classifier on this new Feature List. To do this, first click on Create Feature List and then give the feature list a name and put in 10 for the Number of features, finally clicking Create Feature List. This task is very common as to get the best performance, you typically want to try different feature lists. Change the hyperparameters for this model for the Approximation Method to “fourier” and rerun the model. To find Advance Tuning, go to Evaluate > Advance Tuning. While DataRobot automatically does a search for the best hyperparameters, it is possible to create your own grid searches using DataRobot. To change the Approximation Method, just click in the box and select fourier.  
View full article
Deployment is a critical part to gain value from a model. DataRobot offers many ways to deploy a model, here we focus on two widely used methods: GUI and API. Dataset attached: For this lab, use the Lending Club Guardrails dataset attached to this exercise. Setup Make sure you have the Lending Club Guardrails dataset downloaded on your computer. Exercises Using the GUI, score the data by using Predict > Make Predictions option. Add in columns for 'addr_state' to your predictions. Get four predictions explanations for every row of the scoring dataset. Deploy the model using the API by using Predict > Deploy.   Congratulations on completing these exercises! When you're ready, click Spoiler to reveal the solutions and check your work.   Solutions   Using the GUI, score the data by using Predict > Make Predictions option. Add in columns for 'addr_state' to your predictions. To score data, use the Make Predictions tab. To add in additional columns, use the Optional Features option. This is a very common method for scoring data when there is no need for automation or real-time predictions. Get four predictions explanations for every row of the scoring dataset. The prediction explanations can be downloaded through the GUI. In this case, you should set the number of explanations to 4. The shaded part of the prediction distribution represents the predictions that DataRobot will calculate explanations on. To get explanations on for every prediction, just drag the two bars together to cover the area. You can then compute and download the explanations using the GUI. Deploy a model using the API. To deploy a model through the API, click over to API and click on Add New Deployment under the Deploy tab. Any of the models in DataRobot can be deployed this way. Using the REST API for predictions allows DataRobot to support realtime predictions as well as setup automated prediction pipelines.   
View full article
This article explains the basics of model factories. Note: Linked scripts were developed using Python 3.7.3 and DataRobot API version 2.19.0. Small adjustments might be needed depending on the python version and DataRobot API version you are using. Model Factory Definition A model factory, in the context of data science, is a system or set of procedures that automatically generate predictive models with little or no human intervention. Model factories can have multiple layers of complexity often called modules. One module might be training models while other modules could be deploying or retraining the models. Why build a model factory? Consider the following scenarios: You have 20.000 SKUs and you need to do sales forecasting for each one of them. You have multiple types of customers and you are trying to predict churners. How would you tackle these? Would you build a single model? And, would that single model (single preprocessing method included) be enough? Model Factory Architecture If you wish to find the code to reproduce a model factory using DataRobot, use this notebook. For the purposes of this post, we will only take a look at the DataRobot model factory architecture: Figure 1. Model factory architecture You start by splitting data based on a group column. The group column can be anything really: a feature that differentiates between the products of your company, the different customer segments, a feature that splits data based on their geography. After splitting the data, you create a new DataRobot project for each one of the datasets. DataRobot will find the best algorithm and preprocessing technique for each one of them; then you can deploy the best model and make it ready to receive new data. The above architecture is the absolute minimum requirement for a model factory. You could add another layer of automation in the form of automated retraining and redeployment based on accuracy and data drift, or you could also add your own custom functionality on top of the DataRobot models. The procedure described becomes seamless when you are working with the DataRobot API in either Python or R since you will not have to waste time splitting data and creating multiple projects manually. The real power of using model factories with DataRobot is that you can fit the best model for each subset of your observations while still automating out of sample validation, machine learning preprocessing, and deployment. In high cardinality data, where accuracy is of importance, the model factory approach will almost always outperform the single model approach and that increase can translate into substantial business value. You can find a Python notebook and media files, and sample training and test datasets for this model factory introduction in the DataRobot Community GitHub.
View full article
This page will help you get started with using the datarobot R package to interact with DataRobot. You can import data, build models, evaluate metrics, and make predictions right from the R console. There are several advantages to interacting with DataRobot programmatically: You can set up a series of tasks and walk away while DataRobot and R do the rest. You can get more customized analyses by combining the power of R with the various outputs you can get from DataRobot. You can more easily reproduce prior results with code. You can find the full documentation of the datarobot R package here. Code samples You can access code samples on our public  DataRobot Community github. This section lists the currently available samples and provides links to the related GitHub locations. Initiating Projects Learn how to get started with R and DataRobot by importing data and starting a project. Starting a Binary Classification Project Starting a Multiclass Classification Project Starting a Regression Project Starting a TIme Series Project Starting a Project with Selected Blueprints Advanced tuning and Partitioning Learn how to customize components of the modeling process. Advanced Tuning Datetime Partitioning Model Evaluation Learn how to export and visualize key metrics for evaluating and interpreting your models. Getting Confusion Chart Getting Feature Impact Getting Lift Chart Getting ROC Curve Getting Word Cloud Compliance Docs Learn how to download full documentation files for the models you created. Getting Compliance Documentation Feature Lists Manipulation Learn how to manipulate feature lists and do advanced feature selection. Advanced Feature Selection Feature Lists Manipulations Transforming Feature Types Make Predictions Learn how to make predictions in DataRobot from R. Getting Predictions from Prediction Explanations Model Management Learn how to manage and monitor your models. Model Management and Monitoring Use Cases Explore end-to-end use cases for integrating R and DataRobot. Detecting Droids with DataRobot Hospital Readmissions Model factory with Readmissions Dataset Time Series Model factory
View full article
This landing page will help you get started using the datarobot Python package to interact with DataRobot. You can import data, build models, evaluate metrics, and make predictions right from the console. There are several advantages to interacting with DataRobot programmatically: You can set up a series of tasks and walk away while DataRobot and Python do the rest. You can get more customized analyses by combining the power of Python with the various outputs you can get from DataRobot. You can more easily reproduce prior results with code. View the full documentation for the datarobot Python package here. Code samples You can access code samples from our DataRobot Community public GitHub. This section lists the currently available samples and provides links to the related GitHub locations. API Training Learn how to use the DataRobot API through a series of exercises. API Training Initiating Projects Learn how to get started with Python and DataRobot by importing data and starting a project. Starting a Binary Classification Project Starting a Multiclass Classification Project Starting a Regression Project Starting a TIme Series Project Starting a Project with Selected Blueprints Advanced tuning and Partitioning Learn how to customize components of the modeling process.  Advanced Tuning Datetime Partitioning Model Evaluation Learn how to export and visualize key metrics for evaluating and interpreting your models.  Getting Confusion Chart Getting Feature Impact Getting Lift Chart Getting Partial Dependence Plot Getting ROC Curve Getting Word Cloud Compliance Docs Learn how to download full documentation files for the models you created. Getting Compliance Documentation Feature Lists Manipulation Learn how to manipulate feature lists and do advanced feature selection. Advanced Feature Selection Feature List Manipulation Transforming Feature Types Making Predictions Learn how to make predictions in DataRobot from Python. Getting Predictions with Prediction Explanations Scoring Big Datasets—Batch Prediction API Model Management Learn how to manage and monitor your models. Model Management and Monitoring Uploading Actuals to a DataRobot Deployment Sharing Projects Helper Functions A list of ready-to-use helper functions that will help you solve common problems fast. Helper Functions Use Cases Explore end-to-end use cases that integrate Python and DataRobot. Hospital Readmissions Medical Claims Fraud Lead Scoring Predicting COVID-19 at the county level Model Factory with Readmissions Dataset Double Pendulum with Eureqa Models Lithofacies and One vs Rest with DataRobot
View full article
You can import data into DataRobot in a number of ways. You can do this with curl commands from the REST API or using our Python SDK. There are 3 types of data sources that are supported by DataRobot. Local FIle URL JDBC Connection Once you have your data imported the next step is to build models. Learn how to build models here. You can find the full Python Client documentation here. You can get the sample code for this workflow and snippets in DataRobot Community GitHub. Import Data with REST API From a Local File: Requirements api_key—find this in your profile within the platform file_path—identify the path to the file you want to import Request Code   curl \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: multipart/form-data" \ -X POST \ -F 'file=@YOUR_FILE_PATH' \ https://app.datarobot.com/api/v2/projects/   Example Request (cURL code sample--importing from file)   API_KEY=YOUR_KEY FILE_PATH=~YOUR_PATH DR_ENDPOINT=YOUR_DR_URL/api/v2/projects curl -v \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: multipart/form-data" \ -X POST \ -F file=@$FILE_PATH \ $DR_ENDPOINT   From a URL: Requirements api_key—find this in your profile within the platform url— enter the URL that leads to the data file Request Code   curl -v \ -X POST \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d "{\"url\": \"YOUR_DATA_FILE_URL\"}" \ https://app.datarobot.com/api/v2/projects   Example Request (cURL code sample--importing from URL)   DATA_FILE_URL=https:/user/10k_diabetes_test.xlsx API_KEY=YOUR_API_KEY DR_ENDPOINT=YOUR_DR_URL/api/v2/projects/ curl -v \ -X POST \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d "{\"url\": \"$DATA_FILE_URL\"}" \ $DR_ENDPOINT   From a JDBC Connection: Requirements api_key—find this in your profile within the platform datasourceId—enter the datasourceId (ID of the datasource object) user—username for database password—password for database Request Code   curl -v \ -X POST \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ --data '{"dataSourceId": "DATASOURCE_ID", "user": "DB_USERNAME", "password": "DB_PASSWORD"}' \ https://app.datarobot.com/api/v2/projects/   Example Request   API_KEY=YOUR_API_KEY DATASOURCE_ID=YOUR_DATASOURCE_ID DB_USERNAME=user DB_PASSWORD=password DR_ENDPOINT=YOUR_APP_URL/api/v2/projects/ curl -v \ -X POST \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ --d "{\"dataSourceId\": \"$DATASOURCE_ID\", \"user\": \"$DB_USERNAME\", \"password\": \"$DB_PASSWORD\"}" \ $DR_ENDPOINT   Import Data using Python From a Local File: Requirements API Key—profile in the platform. Import DataRobot Package and be connected to DataRobot (learn here) filepath—path to the data file project_name—name you want to assign to the project Code   import datarobot as dr dr.Client(token='YOUR_API_KEY', endpoint='https://app.datarobot.com/api/v2') project = dr.Project.create('<filepath>', project_name='<project name>')   Example (Python code sample--importing from file)   import datarobot as dr dr.Client(token='YOUR_API_KEY', endpoint='https://app.datarobot.com/api/v2') project = dr.Project.create('/Users/Desktop/10k_diabetes.csv', project_name='Diabetes')  
View full article
Models are the main object you can create with DataRobot. You can use them to make predictions. You can build a model using curl commands from the REST API or using our Python SDK. There are two steps to building a model. Create a project. Build a model. Once you have built your models using Autopilot, you can create a deployment using the best model. Learn how to create a deployment here. Learn about evaluating models here. You can access the DataRobot Python documentation here. You can get the sample code for this workflow and snippets in DataRobot Community GitHub. Build Models with the REST API Requirements api_key—found in your profile in the platform. Created project (learn here) projectId—returned from above request or from project URL (first number). For example, app.datarobot.com/projects/<projectid>/models. target_feature—the feature or column that you are trying to predict. Terminal Request   curl -v \ -X PATCH \ -H 'Authorization: Bearer API_KEY' \ -H 'Content-Type: application/json' \ --data '{"target": TARGET_FEATURE}' YOUR_DR_URL/api/v2/projects/PROJECT_ID/aim/   Example Request (cURL sample)   API_KEY=YOUR_API_KEY PROJECT_ID=YOUR_PROJECT_ID TARGET=YOUR_TARGET_FEATURE ENDPOINT=YOUR_DR_URL/api/v2/projects/$PROJECT_ID/aim curl -v \ -X PATCH \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ --data "{\"target\":\"$TARGET\"}" $ENDPOINT   Build Model using Python Requirements API Key—profile in the platform. Import DataRobot Package and be connected to DataRobot (learn here) Created project (learn here) target— column that you want to predict mode— run Autopilot Code   import datarobot as dr project.set_target(target= <target>,mode= <mode>)   Example (Python sample)   import datarobot as dr project.set_target(target=’readmitted’, mode=dr.AUTOPILOT_MODE.FULL_AUTO)  
View full article
When you deploy a model, it creates an endpoint for the model you want to deploy. You can send new data to this deployment/endpoint and get predictions from that model. You can achieve this with a curl command from the REST API or using our Python SDK. Deploying your model allows you to easily apply your models to new data. You can also monitor things like service health, accuracy, and data drift using a deployment. You must have a completed model in order to deploy one. You can also deploy a model in the GUI (see the in-app documentation on Deployments for more information). You can learn how to use this deployment to make predictions here. You can find our full Python Client documentation here. You can get the sample code for this workflow and snippets in DataRobot Community GitHub. Deploy a model with REST API Requirements api_key—found in your profile in the platform A completed model—either built by you or someone else (learn here) modelId—you can find this in the URL of the model (second number) defaultPredictionServerId—you can learn how to do this here Terminal Request (cURL sample)   curl -v \ -X POST \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ --data "{\"modelId\": \"$MODEL_ID\", \"defaultPredictionServerId\": \"$PREDICTION_SERVER_ID\", \"description\": \"...\", \"label\": \"...\"}" \ https://app.datarobot.com/api/v2/deployments/fromLearningModel/   Terminal Response deploymentId - you can use this to refer to this specific deployment. Example Request   API_KEY=YOUR_API_KEY MODEL_ID=YOUR_MODEL_ID PRED_SERVER_ID=YOUR_PREDICTION_SERVER_ID ENDPOINT=YOUR_DR_URL/api/v2/deployments/fromLearningModel/ curl -v \ -X POST \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ --data "{\"modelId\": \"$MODEL_ID\", \"defaultPredictionServerId\": \"$PREDICTION_SERVER_ID\", \"description\": \"A description\", \"label\": \"A label\"}" \ $ENDPOINT   Example Response   {"id": "abcdef1234567890"}   Deploy a Model with Python Requirements API Key—profile in the platform Import DataRobot Package and be connected to DataRobot (learn here) A completed model—either built by you or someone else (learn here) The model ID—you can find this in the URL of the model (second number) For example: app.datarobot.com/projects/54fd7e51426479da/models/<modelId>blueprint The prediction server ID—you can learn how to do this here Code (Python sample)   dr.Deployment.create_from_learning_model( model_id, label, description=None, default_prediction_server_id=None,)   Example   deployment = dr.Deployment.create_from_learning_model( model_id = ‘1d102du0zd22e2d122u09s’, label='New Deployment', description='A new deployment', default_prediction_server_id="5a22dza0fbd723001a2f70d9")  
View full article
Announcements
Looking for live, instructor-led classes? See details: DataRobot University
Top Contributors