Automated Machine Learning
Automated Time Series
Support Knowledge Base
Guided AI Learning
AI & ML General
Turn on suggestions
Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.
Showing results for
Search instead for
Did you mean:
Register to join this May 26th learning session: Best Practices for Imbalanced Data and Partitioning
Subscribe to RSS Feed
Invite a Friend
Print Knowledge Base
These on-demand resources are intended to help DataRobot users learn about specific capabilities of the platform, get answers to common questions, and learn best practices for AI. You can find links to all Learning Center material in the
Knowledge Base Articles
Quick Index for the Learning Center
The Learning Center has many resources to help you at all stages during your machine learning journey. This page provides a quick overview index of the material. (If you see something that's missing from our lists here, please click Comment and let us know!) Types of content Quick Takes Use Case Demonstrations Data Prep Importing Data Exploratory Data Analysis Modeling Options Comparing Models Investigating Models Deploying Models / MLOps API Access (R/Python) Applications FAQs 1. Quick Takes Types of Data Science Problems that DataRobot Addresses Is DataRobot a Black Box? Automated Machine Learning Walkthrough Automated Time Series Walkthrough What are the Deployment Options with DataRobot? How can I Import Data into DataRobot? What are the modeling options in DataRobot? How can I evaluate my model? How can I explain a prediction? How can I understand my model? What features are important in my model? 2. Use Case Demonstrations Overview of using DataRobot to train models on your data and score new data. Hospital Readmissions (Classification) NBA Player Performance (Regression) Churn Playbook Next Best Offer (Multiclass) Lead Scoring DataRobot + RPA Ticket Classification Demo Predicting COVID-19 at the County Level Demand Forecast—Single Series Demand Forecast—Multi-Series Detecting Droids with R and DataRobot 3. Data Prep (DataRobot Paxata) Provides best practices and other helpful information for preparing data for machine learning. Best practice series Best Practices for Building ML “Learning Datasets” (overview) 01: Welcome to data science: now where do I begin? 02: Best practices for sourcing data to teach your ML models 03: Building a "Learning Dataset" for your ML model 04: Preparing and Exploring your the data in your “Learning Dataset” 05: Target Leakage--the model killer: how to recognize it and prevent it Data prep insights and helpful tips Overview of Data Prep with Paxata DataRobot Paxata and Data Prep for Data Science Quick Help: Data Import Paxata+DataRobot Demo: Lending Club Joining Datasets for Feature Enrichment How Do I Add a Target to the Training Dataset? Unwanted Observations in My Dataset—How Do I Remove Them? Binning Ranges into Categories EDA: Histograms to Help You Understand and Prep Your Data How do I Normalize my Categorical Variables? How Do I Format Dates for Recognition by My Training Model? Selecting only the variables that matter for your model 4. Importing Data Explains how to pull data into DataRobot for modeling as well as how much data preparation DataRobot requires for modeling. Importing Data Overview Automated Feature Engineering in DataRobot JDBC Integration Importing from AWS S3 Blending Datasets AI Catalog—Overview Exercises for Importing Data 5. Exploratory Data Analysis Shows how to explore your data while understanding the automation and guardrails DataRobot has in place. Feature Lists Target Based Insights Configurable Autopilot Exercises for Exploratory Data Analysis 6. Modeling Options Focuses on the processes for modeling setup (such as partitioning) that precede the building of models. Modeling Options Overview Modeling Options (Advanced) Using Text Visual AI: Quick Overview Feature Discovery One-vs-Rest Models with DataRobot Multicollinearity Visual AI: Classify Bell Pepper Leaves Anomaly Detection Exercises for Modeling Options 7. Comparing Models DataRobot’s automation builds many models. This section explains tools for comparing models. This includes building models at different sample sizes, feature lists, and ensembling models. Evaluating the Leaderboard Blenders and Ensembles Introduction to Eureqa Testing External Datasets Exercises for Comparing Models 8. Investigating Models DataRobot offers many tools for evaluating your model and for explaining how the model works. This section covers evaluating the overall accuracy of a model as well as interpretability/explainability tools within DataRobot. Model Insights Describing and Evaluating Models Understanding Models Overview How Can I Evaluate My Model? How Can I Explain a Prediction? How Can I Understand My Model? How to Understand a DataRobot Model What Features are Important to My Model? Tuning Hyperparameters Interaction Discovery and Control using DataRobot GA2M Exercises for Investigating Models 9. Deploying Models Deployment is a critical component to gaining real value from a model. DataRobot offers many ways to deploy a model. Deploying Model Overview Exercises for Deploying Models Deployment—Make Predictions Tab Introduction to Model Monitoring Deploying a Model to Hadoop Model Drift / Replacement Codegen/Scoring Code Exporting models with DataRobot Prime Using the API Batch Prediction API MLOps Deployments Dashboard Deploying a DataRobot Model from the Model Registry Creating a Model Package for a DataRobot Model Depoyment Details Monitoring Performance with Accuracy and Data Drift Uploading Actuals Replacing a Model A Complete Deployment Workflow for DataRobot Models MLOps Governance MLOps Notifications 10. API Access (R/Python) DataRobot is available to use programmatically through our API. Advanced data scientists prefer this approach for integration with other data science tools as well as for setting up automation pipelines. This set of resources is intended to introduce you to the DataRobot API. Importing Data into DataRobot Build Models with Autopilot Deploy a Model Get the Prediction Server ID Make Predictions DataRobot API Python Client DataRobot API R Client Introduction to a Model Factory Advanced Feature Selection with R Exercises for API 11. Applications Pre-built application accelerators you can use to quickly create AI applications from deployed models. What-If Predictor 12. FAQs Provides answers to frequently asked questions related to machine learning, from thinking about data to building, evaluating, and deploying models. FAQs: Setting Up Models FAQs: Building Models FAQs: Interpreting Models FAQs: Evaluating Models FAQs: Deploying Models
View full article
Best Practices for Building ML “Learning Datasets”
The science of building datasets to teach your ML models Did you know that 80% of a data scientist’s time is spent finding, cleaning, and reorganizing data? The great news: DataRobot has a data prep product that empowers you to significantly reduce the amount of time you spend preparing your data. But before you can even begin your data prep work, you’ve got to know where to start your predictive analytics journey--for example: How do you define or frame the business problem you need to solve? How do you identify the kinds of data you require? And, how do you structure your data in order to successfully teach your ML models? The articles in this series assist you in tackling these data science fundamentals so that you can jump start your predictive analytics journey. 01: Welcome to data science: now where do I begin? 02: Best practices for sourcing data to teach your ML models 03: Building a "Learning Dataset" for your ML model 04: Preparing and Exploring your the data in your “Learning Dataset” 05: Target Leakage--the model killer: how to recognize it and prevent it
View full article
01 Welcome to data science: Where do I begin?
This is the 1st article in our Best Practices for Building ML Learning Datasets series. In this article you’ll learn: How to articulate the business problem you need to solve. How to acquire subject matter expertise to assist in you creating a strategy for solving the problem. How to define the essential elements required to build your first ML Learning Dataset. Are you beginning your professional journey as a citizen data scientist? Perhaps your background as a business intelligence professional or SQL analyst has led you to your current citizen data scientist role? Or, are you thinking about a career in data science? If so, you probably won’t be surprised to hear the Gartner group predicts that “citizen data scientists will surpass data scientists in the amount of advanced analysis they produce, largely due to the automation of data science tasks.” The purpose of this article is to address some of the fundamental questions that every citizen data scientist must ask before even beginning to work on building predictive models—questions that allow you to clearly understand and articulate the business problem you want to solve through AI and predictive analytics. Before jumping into the business problem you need to solve and the data you’ll need to teach your Machine Learning models, let’s take a big step back and look at the entire machine learning life cycle: DataRobot and DataRobot Paxata can help you almost every step of the way in this Life Cycle. But before you can even get the Life Cycle for your project off the ground, you’ve got to define some very salient project objectives. And the three that are highlighted above are the ones we’ll review in this article. 1. Specify the business problem It’s essential to define the business problem that you want to solve—and define it in very specific ways. In short, the problem statement that you articulate will define the data to acquire for the purpose of teaching your models. So, when you’re thinking about your problem statement, ask yourself: What do I want to predict? For whom? When? For example: I want to predict where a 10th grade student will attend university in the fall of 2022, and I want to predict that now. I want to predict if a discharged patient will be readmitted to the hospital during the 30 days following discharge—for the same issue—and I want to predict this at the time of the patient’s discharge. We’ve reviewed a couple of solid examples for business problems above. But sometimes it’s also helpful to see an example of something that misses the mark. Here’s an example of a problem statement that is not specific enough: Readmissions cost our hospital $65M last year, and we don’t have a way to determine which patients are at risk of readmission. Notice the problem statement is missing actionable details. Sure, it’s a fact that readmissions are expensive so it would be great to have a way to determine readmission rates in advance. But how to go about this? Notice in the example of good problem statements, we are predicting readmission “for the same issue”. Which leads us to ask: what’s the issue? If we can get insights or find data that highlights patterns of readmission rates—for example diabetic patients have a high readmission rate for issues related to managing their diabetes—then we know we are getting very precise in our problem statement. Which then informs the next step in our project objectives: acquiring subject matter expertise. 2. Acquire subject matter expertise to assist in creating a strategy for solving the problem Expert insights are essential before you even begin to build datasets to teach your models. These insights can come in the form of existing data that may be available to you or through persons in the organization who have particular business knowledge required to solve the problem. Let’s go back to the well-defined readmission example: I want to predict if a discharged patient will be readmitted to hospital during the 30 days following discharge—for the same issue—and I want to predict this at the time of the patient’s discharge. It’s not sufficient to simply look at readmissions data alone. In fact, it’s entirely possible that any patient can be readmitted to hospital for an entirely different health issue. Perhaps diabetes management brought a patient in for the first admission, but a car accident brought the patient back. With no pattern or common variable for readmission, it’s difficult, if not impossible, to spot which types of the patients are likely to be readmitted. This is your clue to speak with subject matter experts at the hospital—perhaps even people who work in the admissions department—to see if they have noticed a pattern for the types of patients who are being readmitted. The pattern that gets noticed can then be vetted with data that you can request. If you’re told that it seems diabetic patients are readmitted often for issues related to managing diabetes, then you can use the data to back up that claim. And with the supposition backed up by actual data, you can now move towards the very interesting business problem of predicting which of those diabetic patients will be readmitted to the hospital, for issues related to their diabetes, within 30 days of their initial discharge. Now that you know, precisely, your business problem, you’re ready to define your prediction target and unit of analysis. 3. Define your prediction task and unit of analysis for your Learning Dataset Now it’s time to dig into some data science terminology. With your business problem clearly articulated, you need to define your prediction task and unit of analysis because these directly inform the data you need to source in order to build your Learning Dataset. But what are these? Prediction task: This is *what you want to predict*. Using the hospital readmission example, your prediction task is to identify if a diabetic patient will be readmitted within 30 days of discharge for issues related to diabetes. The task therefore is answering “Yes” or “No” for each patient in the prediction analysis. Note that sometimes you’ll hear “prediction task” used interchangeably with prediction target and target variable. The prediction target is simply what you want to predict with your prediction task. And the target variable is simply the 'variable' (or column) in a Learning Dataset that provides the historical record of what actually happened. Eg, if a person actually was readmitted to the hospital. That's how the model learns - by seeing examples of what occured. For a few more technical details, see the Target Variable wiki page. Unit of analysis: This the *for whom* or *what* of your prediction. Again, circling back to the hospital readmission data, the *for whom* is i. A diabetes patient ii. Who has been discharged from the hospital Notice how specific we articulate the unit of analysis. This precisely defines the kind of data we’ll need to source—data for diabetic patients who have been recently discharged, in which each row of your data represents a single record for a patient. And for each row, our prediction will be a binary “yes” or “no” regarding a patient’s readmission status to the hospital. Note: Though you may come across “unit of observation” as a term that is defined as a subset of “unit of analysis,” these two terms are used interchangeably when working with DataRobot. Learning Dataset: This is the initial dataset you create in order to feed your models so that they can learn. After the models are fed data from your Learning Dataset, DataRobot presents a leaderboard that lists the top-performing models. You can explore these models to compare their accuracy scores and then further explore each model to review the importance, or impact, each variable (feature) has on making the prediction. You will then begin iterating on your initial Learning Dataset to create a Training Dataset that ultimately becomes the Prediction Dataset you deploy to production. Note that the articles in this series are directed at assisting you in building your first Learning Dataset. When you are ready to start refining that data to continue training models from the Leaderboard, then you’ll want to begin your journey towards understanding the DataRobot Models . Putting it all together Up to this point we’ve used hospital readmission as our working example. Here are a few more examples of business problems that are well-suited to predictive analytics. For each problem, see if you can identify: a strategy for solving the problem, the prediction task, and a unit of analysis. When you’re ready to start sourcing and preparing the data you’ll need to satisfy your prediction task and unit of analysis, carry on to the next article in this series: Best practices for sourcing data to teach your ML models. If you want a deeper dive on more data science concepts, be sure to explore the other articles here in our Community and also check out the Artificial Intelligence wiki.
View full article
02 Best practices: Sourcing data to teach your ML models
This is the 2nd article in our Best Practices for Building ML Learning Datasets series. In this article you’ll learn: How to source and organize the data for your Learning Dataset. How to ensure your data is diverse and large enough. Appropriate data types and file formats for your Learning Dataset. When you’re ready to start sourcing data to train your ML models, it’s imperative that you source good data from which your models can learn. You can be the most skilled data scientist in the room with access to a ton of data for teaching your models, but if your data is not ‘good data’—meaning the kind of data that’s required to teach your models well—then your ML project won’t succeed. There are some fundamental best practices that you can follow to ensure the models are fed ‘good data.’ The purpose of this article is to review some of those practices. Sourcing and organizing the data for your Learning Dataset Once you have clearly articulated your business problem, prediction task and unit of analysis, you’re ready to start looking for data to teach your model. That data comes in the form of a “dataset”—which is simply a single database table or a data matrix in which every column represents a variable, or a “feature” of the dataset, and every row corresponds to a single observation or occurrence. Let’s use an example here to illustrate. If you’ve been following our other articles in this series, you’ll recognize the hospital readmission example: Data Diversity and Depth The next important consideration for your Learning Dataset is its diversities. Keep in mind that the Learning Dataset you are building is yours to build—meaning that the data doesn’t come from just one table or file. You are essentially collating various data sources into a single dataset that will become your own Learning Dataset. You are creating a unique dataset that has the features you believe are required to teach a model to make accurate predictions. This doesn’t necessarily mean your Learning Dataset has to be of a specific size. But you must have enough data to feed a model with enough features (represented as columns) and rows (units of occurence). Let’s take a look at some suggested guidelines for how this equates to the size of the dataset you ultimately create: Start smaller, using data sampling techniques. If you’re having trouble finding enough data, consider techniques discussed in Breaking the Curse of Small Datasets in Machine Learning to extract the most value from the data that is available to you. No matter the size of your dataset, it’s important that you have a balance in your data. Otherwise, you don’t provide a balanced representation to feed your model. This can lead to what is termed as “class imbalance.” For a deeper dive on this issue, check out “Dealing with Imbalanced Classes in Machine Learning.” Appropriate data types and file formats Bearing all of the above in mind, where can you begin looking for the types of data you require to build your Learning Dataset? There are three major types of data: Internal to your organization: this type of data is the basis for most modeling projects. It’s usually highly relevant and, hopefully, easy to obtain. External 3rd party data: this includes data you can source online for a fee--for example, marketing survey results, credit reports from reporting agencies like Experian, etc. Public data sources: these include ‘open sources’ for data--for example, census data, economic indicator data from the FRED (Federal Reserve Economic Data), weather information, LinkedIn, etc. Finally, what formats are out there? What data formats can I use to create my Learning Dataset? Check out our Quick Help for Data Import article for details. Once you’ve identified all of the data that you want to use for teaching a model, it’s time to collate it all into a single dataset. There are some important rules to follow when creating that dataset, and we have an article dedicated to that topic when you’re ready: Building a Learning Dataset for your ML model.
View full article
03 Building a "Learning Dataset" for your ML model
This is the 3rd article in our Best Practices for Building ML Learning Datasets series. In this article you’ll learn: Three important questions to ask yourself before building your Learning Dataset. The steps and process for building a Learning Dataset. Important questions to ask before you begin assembling the data When you are ready to start building your Learning Dataset, ask yourself: What do I want to predict? For what or whom? When do I want to make this prediction? Let’s look at how to answer these questions for concrete examples. Hospital readmission: if you’ve been following this series of articles, then you are now familiar with the hospital readmission prediction in which you want to predict if a diabetic patient will be readmitted to hospital; for an issue related to his or her diabetes; and make this prediction at the time a patient is discharge from the hospital. Customer churn: anyone working with customer renewals is aware of the importance of anticipating customer churn rates for service renewals. In this example you want to predict the churn probability; for customers who subscribe to a SaaS offering; during the next three weeks. Notice that we have a time component in both concrete examples above. Also notice that in the second example, we are defining a window of time, not a specific moment in time. However, what you want to predict may not always require a time feature. For example, a classic linear regression problem that forecasts the future cost of real estate—but at no specific time in the future, just in the ongoing window of time—does not require a “for when” to be answered. But as a best practice, it’s always good to ask yourself the “when” for your prediction to ensure you’re very precise in understanding and articulating the problem you want to solve. All of these dimensions regarding time are important to consider as you begin to assemble the data for teaching your mode. With the three questions answered, you’re ready to start building your ML Learning Dataset. Steps for building your ML Learning Dataset When you’re ready, the following steps provide a guide for how to start building your Learning Dataset. Find appropriate data. Merge data into a single table to create your Primary Table, enrich it with secondary tables, and create your Target Variable. Conduct exploratory data analysis. Remove any target leakage. This article primarily focuses on step 2 in the process outlined above. Refer to the other articles in this series that detail the remaining steps above. 1. Find appropriate data When you’re ready to start sourcing data to teach your ML models, it’s imperative that you source good data from which your models can learn. You can be the most skilled data scientist in the room with access to a ton of data, but if your data isn‘t ‘good data’—meaning the kind of data that’s required for your models to learn well—then your ML project won’t succeed. For more details on this important first step, refer to the second article in this series, Best practices for sourcing data to teach your ML models. 2. Merge data into a single “Primary Table”, enrich with secondary tables, and create your Target Variable 2a) The Primary Table for building your ML Learning Dataset After you’ve sourced all of the data you want to use for your business problem, Xavier Conort, DataRobot’s Chief Data Scientist and one of the world’s leading data scientists, advises you to create a “primary table,” which he defines as: “a cleaned version of a learning example... it should exclude all of the information that is not available at prediction time--except the Target Variable (column).” Applying Xavier’s advice to the hospital readmission example, the first step is to source a dataset in which you have a solid learning example or unit of analysis. If you read the first article in this series, Welcome to data science, you’ll remember that a unit of analysis is the “for whom” or “what” of your prediction. So for the hospital readmission primary table, each row corresponds to a single patient and each column corresponds to features for each patient. The next step is to create a “cleaned” version of this dataset, meaning if there are columns (features) in the dataset providing data that would only be know after the prediction time, then remove those columns--except for the target feature, which is also referred to as the “Target Variable.” In the hospital readmission example, the primary table would look something like this—one row for each unique patient instance: Next, following Xaver’s advice, identify any data that needs to be removed because it provides data that is only available after our prediction target. This kind of data is known as target leakage because it will skew your model’s ability to correctly learn. Put another way, leakage is kind of like cheating with your learning data because you’re providing the model with a feature whose value cannot be known at prediction time. We’ll cover target leakage, in more detail, later in this article. For our hospital readmission data, notice there are two columns of data that capture information about each patient after the patient has returned to the hospital for readmission. These columns constitute leakage and so must be removed from our Primary Table: IMPORTANT NOTE: what if one of the columns in the data provides the answer (the Target Variable) for the question you are asking of the data? Sometimes you may actually have data that has the answer to the question you want to predict—especially if the data you have is extracted from a table of past events. If the Target Variable already exists in your data, then you’re in luck and won’t need to perform the operational step(s) to create it. In the following example, the Target Variable already exists in the initial data we sourced for hospital admissions--this column has the data we need to answer for each patient: was the patient readmitted within 30 days of initial discharge? If you don’t find Target Variable data in any of the data you source, not to worry. It’s common practice to actually create that variable yourself after you’ve enriched the Primary Table. We’ll cover that step a little later in this section. Here are some other tips for building a good Primary Table: A distribution of example rows that reflect the distribution you expect at prediction time--in other words, real-world data that reflects the real-world problem your prediction aims to solve. Example: if you have a classification problem you are aiming to solve regarding whether or not a customer will purchase a particular product, then you need healthy examples of customers who did and customers who did not make a purchase. Additionally, if there is a seasonal component to the product, then you need to have enough event (purchase) history to cover multiple season cycles. Avoid ‘example overlap’—for the hospital readmission example, there should be only one row per patient—not multiple instances of the same patient within the same time window. Don’t perform fill-downs or aggregations on the data. In the screenshot example above, notice there are blank cells with missing information. By attempting to resolve those blanks, you are actually preventing the model from learning how to associate this missing information with other variables in the dataset. And as you continue enriching your Primary Table, the missing information may provide key correlations with other data you join into the Primary Table. If there is a time or seasonal component to your prediction—like the hospital example; within 30 days, or the customer churn example within three weeks—then ensure there is enough history in the data to provide your model with enough examples. If your Primary Table does not have any sort of date feature and you anticipate working with data type data, then it's advisable to create one in your Primary Table. This allows you to concretely know the point beyond which leakage can occur. Also, such a date feature will afford you the flexibility to do computations with the date. The Primary Table—and in fact no data that’s used for teaching models--should have Target Leakage. This is such an important topic that we’ve devoted an entire best practices article to Target Leakage: how to recognize and prevent it. 2b) Enrich with secondary tables When your clean Primary Table is complete, start enriching it with secondary tables—which are datasets that include the additional features you think are fundamental to teaching the models. Again, using the hospital readmission example, you may have datasets that include additional, important details about many, if not all, of the patients in your Primary Table—for example their age, gender, current medications, etc. In this case, you should join the data on a common key for both your Primary and secondary tables—for example Patient ID. In this way, you are building out your data from the Primary and enhancing it with new features. Additionally, as you enrich the Primary with more data, you may also find that you can generate additional desired features (columns) by performing sums, subtraction, division, cosine similarity etc. on columns within the dataset. Finally, when creating a Primary Table and enriching it, you should not prep your data as you enrich it. The objective of the enrichment step is to generate more features for the Learning Dataset—not to clean the data up as you go. 2c) Create your Target Variable, if it’s not already in your data Once you have what you believe is a good dataset with enough rows, with enough variety of examples, and enough essential features (columns), it’s time to create the Target Variable column, if it doesn’t already exist in the data. Sometimes the target is created through a simple lookup operation with another table that has the data. Sometimes it needs to be generated through a calculated column operation. And sometimes it needs to be generated by a complex SQL script, for example in the customer churn example we want to make a churn prediction for a three week window of time. For the hospital readmission example in which you have both an admit date and a readmit date, you can create a Target Variable column based on the logic: if readmit date is less than thirty days from admit date, then Y; if readmit date is greater than thirty days from admission date, then N. 3. Perform your data prep steps and exploratory analysis After you have pulled all of your data together into a single Learning Dataset, that’s when you want to begin your data prep and exploration. Your data prep steps will likely include things like standardizing date formats and removing unwanted observations. Then you’ll perform your own exploratory analysis on the data to gain additional insights into how best you can finally prepare the data before you start feeding it to the models. The next article in this best practices series, Data Prep and Exploratory Analysis on your Learning Dataset, takes a closer look at these exercises. 4. Find and remove Target Leakage The final article in this best practice series, Target Leakage: how to recognize and prevent it, provides guidance on how to avoid introducing data leakage into your Learning Dataset.
View full article
04 Data Prep & Exploratory Analysis in your Learning Dataset
This is the 4th article in our Best Practices for Building ML Learning Datasets series. In this article you’ll learn: When your Learning Dataset should be cleaned. Examples of data that should be cleaned. Data Prep Protips. If you read the first article in this series, you’ll remember that 80% of a data scientist’s time is spent finding, cleaning, and reorganizing data. But there’s great news: DataRobot has a data prep product, DataRobot Paxata, that empowers you to significantly reduce the amount of time you spend preparing your data. The purpose of this article is to assist you in quickly assessing if you’re ready to start prepping your Learning Dataset and where you’ll find more help content when you’re ready to use DataRobot Paxata for your prep work. When should my data be cleaned? Your Learning Dataset should be cleaned before you begin any feature engineering. With DataRobot Paxata you can clean your data, and then prep it to add and remove features--all in a single project. What data should be cleaned If you’re looking for starting point examples of when you’ll need to clean your data, here are a few common examples. Deduplication: you have a particular value represented in various ways and you need to standardize on one value, for example “New York”, “NY”, “New York City”, “NYC”, etc. Remove leading values: for example, you need to remove leading zeros for Eastern US zip codes. Standardizing date formats: for example, you have a dates column but the values in that column are represented in various formats--”mm/dd/yyyy”, “dd/mm/yy”, etc. Data Prep Protips before you begin your prep Consider before you aggregate. When you aggregate rows in your data, you are actually losing signals from the detailed records. If you think you must aggregate, then take the opportunity to use feature engineering to represent the data in another way and restore some of those lost signals. For example, you can add sums, means, standard deviations, etc. to create new features. Consider your data point outliers before removing them from your data. Ask yourself if those observations in the data are valuable for the model to learn. Ultimately, you want to optimize for features that are important at prediction time because this results in faster computations with lower memory consumption. Ready to start prepping your data? Visit DataRobot Paxata and Data Prep for Data Science where you’ll learn how to use our data prep product to: join datasets for feature enrichment add a target variable to a training dataset normalize messy categorical variables—for example NY versus New York, CA versus California select only specific variables to save in a training dataset remove unwanted observations format dates so they are recognized by the training model identify and redress missing or incomplete values bin ranges into categories compute windowed aggregates—for example rolling up sales transactions to a daily level profile a dataset to understand your data prior to prep perform exploratory data analysis
View full article
05 Target Leakage: how to recognize and prevent it
This is the 5th article in our Best Practices for Building ML Learning Datasets series. In this article you’ll learn: How to recognize Target Leakage and why it’s problematic. Protips for avoiding Target Leakage. Recognizing Target Leakage and why it’s problematic Target leakage is defined as: including, in your dataset, future information that would not be known at prediction time. When Target Leakage occurs, you are teaching your models with ‘contaminated’ data that results in overly optimistic expectations about how your model will perform in production. In other words, the performance you observe during the model building phase will not match what you’ll see when that model is put into production because the model was unable to properly learn. Think of Target Leakage as looking like the following visual example for interest rate data in which our Learning Dataset includes information that is only available after prediction time. Protips to avoid Target Leakage Create a prediction date feature for transactional data: if you’re using data from transactional tables, you must have a prediction date, which serves as a cutoff date, in the data. This prediction date is a feature (column) that you create in your data and it serves as a boundary in time beyond which you should not include additional transaction data. Avoid having more than one time value in an observation (row): if you have a single row of data with more than one time value in the row, then it's very easy to mistakenly run the prediction without considering both times. Consider how critical data may be affected at prediction time: what if the data changed from the point in time a prediction is needed to the point in time the dataset is created, for example, today? For example: you are predicting if a credit card transaction is fraud. However, when creating your Learning Dataset, you need to be mindful of the fact that, after a fraud event, the bank may automatically close an account until the card user is notified. So if the transactional data you want to use for creating your Learning Dataset has a column for “number of accounts” and it uses the number of accounts from *today* instead of at the time of transaction, then you have target leakage. Check out the following resources for a deeper dive on this topic: Blog: What is Target Leakage and How Do I Avoid it? DataRobot wiki on Target Leakage
View full article
DataRobot Paxata and Data Prep for Data Science
What is data preparation for machine learning? Data preparation is the process of transforming raw data so that it's properly prepared for the machine learning algorithms used to uncover insights and make predictions. Why is data preparation important? Most machine learning algorithms require data to be formatted in very specific ways. Which means your raw datasets generally require some amount of preparation before they can yield useful insights. For example, some datasets have values that are missing or invalid. If data is missing, the algorithm can’t use it. And if data is invalid, the algorithm produces less accurate or even misleading outcomes. Good data preparation produces clean and well-curated data that leads to more practical, accurate model outcomes. So what can I do to prep my data? DataRobot Paxata provides the transformation tools you need to clean, normalize, and shape your data. And once you've cleaned your data, DataRobot Paxata also provides the tools you need to prepare your features for optimal feature engineering. Here's just a short list of how DataRobot Paxata can help you to quickly prep your data to train your machine learning models: join datasets for feature enrichment add a target variable to a training dataset normalize messy categorical variables—for example NY versus New York, CA versus California select only specific variables to save in a training dataset remove unwanted observations format dates so they are recognized by the training model identify and redress missing or incomplete values bin ranges into categories compute windowed aggregates—for example rolling up sales transactions to a daily level profile a dataset to understand your data prior to prep perform exploratory data analysis
View full article
Blenders and Ensembles
Ensemble learning allows you to create more accurate models by combining the power of multiple models: DataRobot calls these types of models “blenders.” Figure 1: Ensemble Learning Blending and Stacking DataRobot uses two major techniques to create blenders. The first is to simply take the means or medians over several models for each observation and use that as a prediction. The second is to take the predictions from several models and use them as features in a final model; this is called stacking (Figure 2). Figure 2: Ensemble example Why is this done? Creating blenders often results in more accurate models. Blenders allow you to use multiple blueprints. And blenders allow you to leverage the wisdom of crowds principles. What are some considerations? It is generally advised to blend models that rely on different algorithms and have good accuracy. While blenders can give you a boost in accuracy, they can often take more time to create and score because they are more complex. Blender models can make the final model more complicated to communicate to regulators; however, DataRobot model interpretability tools and documentation can help you with this process. Leaderboard By default, DataRobot will create four blenders for each Autopilot run (Figure 3). This includes two blenders that blend the top 3 models and two blenders that blend the top 8 models. Figure 3. Leaderboard Blueprints If you're curious about any of these models, all you have to do is click on the blender (model) and the blender blueprint will be displayed (Figure 4). You can see within this blueprint the preprocessing steps for the blender. Notice that these steps include other models that you can find on the Leaderboard. The final step is the actual blending. If you want to look at any of these steps in more detail you can simply click one of the blueprint boxes and, in the displayed pop-up, select to view DataRobot Model Docs. Figure 4. Blueprint Evaluation The same metrics that you use to evaluate and interpret your other models apply here as well (see Figure 5). Figure 5. Evaluation Interpreting Models You can also go to the Understand division to see Feature Impact, Feature Effects, and Prediction Explanations (Figure 6). Figure 6. Interpretation Create your own blender If you want to create your own blender, all you have to do is select models that you want to blend. These two strategies can help you get the most out of your blenders. Stay near the top of the leaderboard, so you will use the most accurate models. Use models that rely on different algorithms. You can then simply go to the menu and, under the section called blending, you can select a number of different types of blenders (Figure 7). Figure 7. Types of blenders This is going to start the blending process and you can see this processing on the right-hand side of the screen. When the blender is complete, it will be added to the Leaderboard and ranked among the other models that you've created.
View full article
Quick Help: Target Selection
To build a predictive model, you need to specify the target which is simply a column in your data that represents what you would like to predict. To select a target in the Data page (What would you like to predict field), start typing a few letters of the column name; DataRobot will show you a list of columns you can select from. After you enter your target, DataRobot will recognize this as either a classification problem (if you have categories) or as a regression problem (if your target is numerical). DataRobot then displays the distribution of the target feature. If you are unsure about your target, please go back and check the Data Import video that explains more about identifying the Target column. Importing Data Exploratory Data Analysis Use Case
View full article
Quick Help: Data Import
Importing AI-ready data into DataRobot is a simple process. Make sure that your data is in a tabular format that adheres to the following minimum data format requirements. Data format requirements Supported File Types: csv, tsv, dsv, xls, xlsx, sas7bdat, bz2, gz, zip, tar, tgz Supported Variable Types: numeric, categorical, boolean, text, date, currency, percentage, and length Minimum Rows Required: 20 Maximum Rows Allowed: The maximum rows allowed for Trial users is 100,000 Specifying a target column To build a predictive model using DataRobot, you need to specify a target for your data. A target is simply a column in your data with a header name that is easy to remember. DataRobot will automatically determine the type of machine learning problem based on the data in your dataset—multiclass classification, regression, or even Time Series. (The dataset attached to this article shows an example of a defined target column. You can download this dataset and use it to test out model building.) Importing from other data sources While the AI Platform Trial is limited to local file imports, DataRobot provides a wide range of JDBC-compliant data sources. The URL supports importing data from a variety of sources, from HTTP to S3. You can use HDFS for ingesting data from Hadoop. Is your data AI-ready? Preparing data for Machine Learning can be an arduous task. Thankfully DataRobot Paxata makes data prep a snap. Learn more about DataRobot Paxata and get started with a 14-day trial here. Importing Data Modeling Options Deployment
View full article
What Features are Important to My Model?
This article provides an introduction to Feature Impact. Feature Impact To find Feature Impact, select the model of interest on the Leaderboard: Figure 1.Leaderboard Then click the Understand division. Feature Impact is shown by default. Figure 2. Feature Impact Feature Impact is a model-agnostic method that informs us of the most important features of our model. The methodology used to calculate this impact, permutation importance, normalizes the results, meaning that the most important feature will always have a feature impact score of 100%. One way to understand feature impact is like this: for a given column, feature impact measures how much worse a model would perform if DataRobot made predictions after randomly shuffling that column (while leaving other columns unchanged). If you want to aim for parsimonious models, you can remove features with a low feature impact score. To do this, create a new feature list (in the Feature Impact tab) that has the top features and build a new model for that feature list. You can then compare the difference in model performance and decide whether the parsimonious model is better for your use case. Furthermore, even though it is not that common, features can also have a negative feature impact score. When this is the case, it will appear as if the features are not improving model performance. You may consider removing them and evaluating the effect on model performance. Lastly, be aware that feature impact differs from the importance measure shown in the Data page. The green bars displayed in the Importance column of the Data page are a measure of how much a feature, by itself, is correlated with the target variable. By contrast, feature impact measures how important a feature is in the context of a model. In other words, feature impact measures how much (based on the training data) the accuracy of a model would decrease if that feature were removed. More Information If you’re a licensed DataRobot customer, search the in-app documentation for Feature Impact.
View full article
This article summarizes how DataRobot handles text features using state of the art Natural Language Processing (NLP) tools such as Matrix of Word Ngram, Auto Tuned Word Ngram Text Modelers, Word2Vec, Fasttext, cosine similarity, and Vowpal Wabbit. It also covers NLP visualization techniques such as frequency value table and word clouds. The following video explains how DataRobot uses text features for machine learning models. Your dataset contains one or more text variables as shown in Figure 1 and you are wondering whether DataRobot can incorporate this information into the modeling process. Figure 1. Input dataset with one or more text variables DataRobot lets you explore the frequency of the words by giving you a frequency value table, which is the histogram of the most frequent terms in your data and a general table where you can see the same information in a tabular format (Figure 2). Figure 2a. Frequency Values Table for word frequency visualization Figure 2b. General Table for word frequency visualization Moving to modeling, DataRobot commonly incorporates the matrix of word-gram in blueprints (Figure 3). This is a matrix produced using a widely used technique, TF-IDF values, and combines multiple text columns. For dense data, DataRobot offers the Auto Tuned Word Ngram text modelers (Figure 4), which only looks at one individual text column at a time. The latter approach uses a single n-gram model to each text feature in the input dataset, and then uses the predictions from these models as inputs to other models. Figure 3. An example blueprint that uses a Matrix of Word Ngram as a preprocessing step Figure 4. An example blueprint that uses an Auto Tuned Word Ngram text modelers as a preprocessing step Auto Tuned models for a given sample size are visualized as Word Clouds (Figure 5). These can be found in the Insights > Word Cloud tab. The top 200 terms with the highest coefficients are shown, along with the frequency with which each term appears in the text. Figure 5. Text visualization using Word Cloud In Figure 5, terms are displayed in a color spectrum from blue to red with blue indicating a negative effect and red indicating a positive effect relative to the target values. Terms that appear more frequently are displayed in a larger font size, and those that appear less frequently are displayed in a smaller font size. There are a number of things you can do to this display: View the coefficient value specific to a term by mousing over the term View the word cloud of another model by clicking the dropdown arrow above the word cloud View class-specific word clouds (for multiclass classification projects) The coefficients for the Auto Tuned Word Ngram text are available in the Insights > Text Mining tab (see Figure 6). It shows the most relevant terms in the text variable, and the strength of the coefficient You can download all the coefficients in a spreadsheet by clicking on the Export button. Figure 6. Text Mining tab Finally, DataRobot also offers more NLP approaches in the Repository, such as Fasttext (Figure 7a) and Word2Vec (Figure 7b). You can find these by typing ‘Word2Vec’ or ‘Fasttext’ in the search box; DataRobot will retrieve all blueprints that contain these preprocessing steps. Figure 7a. Example blueprints with Fasttext as part of their preprocessing steps Figure 7b. Example blueprints with Word2Vec as part of their preprocessing steps Besides all of these, DataRobot has other techniques such as cosine similarity (Figure 8a) when there are multiple text features and Vowpal wabbit-based classifiers; the latter use Ngrams (Figure 8b). Figure 8a. Example blueprints with Pairwise Cosine Similarity as part of their preprocessing steps Figure 8b. Example blueprints with Vowpal Wabbit-based classifiers
View full article
What are the Modeling Options in DataRobot?
What are the modeling options that you can experiment with when building your models? This post provides an overview of some of the options available through the GUI, specifically: specifying a target variable, choosing data partition strategy, optimization metrics, and the model building selections. This video shows an overview of modeling options in DataRobot. After uploading your data, DataRobot performs some exploratory data analysis (EDA) on it and displays the results under the project Data tab, where you can review the shape and distribution of each variable, apply basic transformations, and create derived features (see Figure 1). Figure 1. Results of EDA process After you indicate what the target variable is, DataRobot will provide recommendations for best modeling options given the shape and distribution of the target variable in relation to all other variables in your data. Figure 2.The target variable after DataRobot analyzes it Next, you can customize how the project is set up by selecting specific model building options in the Advanced Options tab (see Figure 3). For example, you can select the partition method that will be used for training and validating your data. You can also select the number of cross validation folds and the percentage of data that will be held out during the model building process. Figure 3. Advanced Options tab Before starting the model building process, you must select the modeling mode: Manual mode (DataRobot runs user-selected models); Quick mode (DataRobot runs selected models at maximum sample size), or Autopilot (DataRobot selects the best predictive models given your target variable and the distribution of your other variables in your dataset) (see Figure 4). Figure 4: DataRobot modeling modes: Autopilot, Quick, and Manual At this point you can press the Start button, which will make DataRobot kick off the automation for your model-building process.
View full article
Paxata+DataRobot Demo: Lending Club
This video demonstration highlights how DataRobot Paxata self-service data preparation can accelerate the preprocessing of raw data sources into clean, consumable training data for modeling in DataRobot.
View full article
Joining Datasets for Feature Enrichment
Do you have several datasets that you need to combine in order to stack all the feature into a single dataset? For example, you have a predictive model for whether or not a patient will be readmitted to the hospital within 30 days of discharge. And you have three datasets: one with admissions data, another with diagnostic codes, and then one with hospital codes that map to categorical values. Now you need to join everything together to get all of your features combined into a single dataset. When you have multiple datasets that you want to join together in order to stack all features into one dataset for your ML models, you can easily do that with the Lookup tool. The Lookup tool provides a join type operation that allows you to combine another dataset with your Base (driving) dataset. Additionally, the tool provides a "Detect Joins" option that visually shows you the variable(s) that your datasets share, along with a percentage score that guides you on how best to combine the datasets. For complete details on the Lookup tool, with plenty of examples, see the official documentation for: Joining data with the Lookup tool.
View full article
How Do I Add a Target to the Training Dataset?
The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding. A supervised machine learning algorithm uses historical data to learn patterns and uncover relationships between other features of your dataset and the target. When you are ready to create a target variable in your dataset, that's easily done through DataRobot Paxata's Computed Column feature. Here's a quick video that shows just how easy it is to create your target variable.
View full article
Unwanted Observations in My Dataset—How Do I Remove Them?
Removal of unwanted observations, including deleting duplicate or irrelevant values from your dataset, is super simple in DataRobot Paxata with three tools that quickly get you on your prepping way: Deduplicate: when you have multiple rows for the same data and want to remove the duplicates. For example, if you have a housing model and your dataset has several records for the same address, then the Deduplicate tool allows you to remove those duplicates and condense the data into a single row. See the Deduplicate documentation for details. Remove Rows: when you want to remove unwanted rows of data from your dataset, the Remove Rows tool is the one for you. For example, you have a feature for housing data that includes values for single family homes and apartments, but you don't want the apartments in your data. In this example, you'll first use a Filtergram to select the "apartment" values, and then the Remove Rows tool to remove them. When you're ready to start removing those rows, see the official documentation for the Remove Rows tool. Columns Management tool: in DataRobot Paxata, your variables are managed through columns in the dataset. The Columns tool allows you to rename, remove and reorder any column variables in your dataset. We have another Community article on this topic that gives you an overview of how to locate and use this tool. And there is also official documentation here: Columns management tool. If you're not quite sure where or how to begin looking for duplicates or irrelevant values in your data, we recommend checking out our Community article on Exploratory Data Analysis: histograms to help you better understand your data. You can also check out our official docs on filtergrams, the tool of choice for understanding exactly what's buried in your variables.
View full article
Binning Ranges into Categories
When preparing your data for machine learning, there are times when you will want to do a binning exercise that allows you to further categorize your data. For example, in the dataset in this video, a Filtergram on the "Ownership" variable displays all of the vales for that variable. But looking closely, we see that all of the values really do distill down into two types: Government and Non-Profit. So let's create a new variable in our dataset to capture that distinction—and here's a short binning video to show you how we can do that.
View full article
How Can I Explain a Prediction?
This article explains Prediction Explanations. Prediction Explanations To find Prediction Explanations, click on the model of interest in the Leaderboard. Figure 1.Leaderboard Then select Understand > Prediction Explanations. Figure 2. Prediction Explanations Prediction Explanations tell us why our model assigned probability x to a specific observation. By default, DataRobot will give the top three reasons why this prediction was made but you can get a maximum of ten reasons. In the bottom of this panel, for the top row, we see that a patient was assigned a 94.7% probability of being readmitted into the hospital. DataRobot prediction explanations attribute this to the rather high number of inpatient stays, the patient's weight, and the missing admission type id. These explanations provide context for a prediction and are helpful for explaining a prediction. More Information If you’re a licensed DataRobot customer, search the in-app documentation for Prediction Explanations.
View full article
How Can I Import Data into DataRobot?
This page provides an overview of the data file types and data import methods that DataRobot supports on a new project page. This video demonstrates the process for importing data. DataRobot currently supports .csv, .tsv, .dsv, .xls, .xlsx, .sas7bdat, .bz2, .gz, .zip, .tar, and .tgz file types. These files can be uploaded locally by either dragging them onto the user interface or using a file browser. You can also fetch them from a remote server using a URL or a local HDFS file system, or read directly from a variety of enterprise JDBC enabled databases as shown in Figure 1. DataRobot also offers the ability to import data from the DataRobot AI Catalog. This is a centralized data store that is tightly integrated into the DataRobot AI platform. Figure 1. Methods for importing data into DataRobot More information If you're a licensed DataRobot customer, search in-app documentation for Import data for an overview of importing data.
View full article
How Can I Understand My Model?
This page provides a short summary of Feature Impact, Feature effects, and Prediction Explanations. For more detailed coverage, see Understanding Models Overview. Feature Impact To find Feature Impact, click on the model of interest in the Leaderboard. Figure 1.Leaderboard Then click on the Understand division. Feature Impact is shown. Figure 2. Feature Impact Feature Impact is a model-agnostic method that informs us of the most important features. The results are normalized so the most important feature will always have a feature impact score of 100%. If the feature Impact is small, it suggests that removing the feature will not reduce predictive power. Feature Effects Feature Effects can be found by clicking on the Feature Effects tab (next to the Feature Impact tab). Figure 3. Feature Effects Feature effects tell us how the individual changes in the values of a feature affect the target outcome if everything else remains steady. For example, in Figure 3, we see that as the number of inpatient stays increases (X-axis), the probability of being readmitted into the hospital (Y-axis) also increases. There seems to be some diminishing effects though as above 4 to 5 inpatient stays, the probability of being readmitted does not increase significantly. Prediction Explanations Prediction Explanations can be found by clicking on the Prediction Explanations tab (next to the Feature Effects tab). Figure 4. Prediction Explanations rediction Explanations tell us why our model assigned probability x to a specific observation. By default, DataRobot will give the top three reasons why this prediction was made but you can get a maximum of ten reasons. This is very useful for getting a better understanding of a specific prediction. More Information If you’re a licensed DataRobot customer, search the in-app documentation for Feature Impact, Feature Effects, and Prediction Explanations.
View full article
How to Understand a DataRobot Model
DataRobot AutoML gives you multiple models and tools to compare them—this ebook explains why that's important to successful machine learning. To explain model accuracy, DataRobot provides drill down tools such as Lift charts, ROC curve, and Residuals. DataRobot's various interpretability methods like Feature Impact, Feature Effects, and Prediction Explanations are covered in depth. Also included in this book are explanations of how DataRobot provides rules through Hotspots and a formula with Eureqa.
View full article
This page explains DataRobot tools for model hyperparameter tuning. Advanced Tuning DataRobot will automatically search for the best hyperparameters to optimize your models. If you wish to investigate the best hyperparameters, or if you wish to change them and tweak them in search for a better model, start by clicking on that model in the Leaderboard. Figure 1. Leaderboard Then, click on Evaluate > Advanced Tuning. Figure 2. Advanced Tuning options If you followed the instructions properly, you should be looking at something similar to Figure 2. At the top, you have three options to choose from: New Search Searched Best of Searched New Search The New Search option is for those data scientists looking to initiate a new search of best parameters. The parameters are split into two groups. You have preprocessing parameters like missing value imputation and then you have model parameters like number of trees for example. Click on the parameters you want to change and just type in the value you would like to try out. In some cases, you will even be able to put multiple values in the form of a list. When you are done, go to the bottom of the menu where you will be able to initiate a Smart Search or a Brute Force search as seen in Figure 3. Please keep in mind that extensive searches can take a long time to compute. Figure 3. Initiating a new hyperparameter search Searched By clicking on Searched, you will be able to see all of the values DataRobot tried out when searching for the best set of hyperparameters. Best of Searched Best of Searched will yield just the best parameters based on the search that DataRobot conducted. More Information If you’re a licensed DataRobot customer, search the in-app documentation for Advanced Tuning.
View full article
Comparing Models Overview
In addition to the information available from the Leaderboard, the Models page provides other tabs to help compare model results. These are Learning Curves, Speed vs Accuracy, and Model Comparison. Figure 1. Tools for comparing model results Learning Curves shows how performance changes as the sample size increases. You can use learning curves to help determine whether it's worthwhile to increase the size of your dataset. Getting additional data can be expensive, so it may be worthwhile if it improves the model accuracy. In Figure 2, we see lines in the graph on the left with dots that connect each line segment. Each dot represents a portion of the training data. So by default we have 16 percent, 32 percent and 64 percent of our training data. And if the holdout is unlocked, then the validation data performance is shown as well as up to 80 percent. Hovering the mouse over any of the line segments highlights the name of the associated machine learning algorithm (listed to the right). Each line represents the machine learning algorithm and the feature list that was used to train it. So each is a group that consists of the models for each of the different training data set sizes. Figure 2. Learning Curves graph The Speed vs Accuracy analysis plot shows the tradeoff between prediction runtime and predictive accuracy, and helps you choose the best model with the lowest overhead as a combination of the two. On the Y-axis we see the currently selected metric, which in this case is LogLoss. On the X-axis we see the prediction speed as the time in milliseconds to score two thousand records. Like the learning curves display, we can hover the mouse over each dot or the name of the machine learning algorithm to highlight its counterpart on the opposite graph. Figure 3. Speed vs Accuracy Model Comparison provides a mechanism to show more detailed ways to compare two models in your project. Comparing models can help identify a model that more precisely meets your requirements. It can also help in selecting candidates for ensembling, or building blender models, as they are called. For example, two models may diverge considerably, but by blending them, you can improve your predictions. Or maybe you have two relatively strong models and, by blending them, you can create an even better result to create a model comparison. You need to first select the two models you want to compare, shown at the top of the page in blue on the left and yellow on the right. By clicking either of those, you're able to select a model. Next, choose a chart type that you want to display for comparing the selected models. Starting with the ROC curve, this option helps to explore classification projects in terms of performance and statistics, namely the balance of the true positive rate and the false positive rate as a function of a cutoff threshold. Figure 4. Model Comparison The Lift chart depicts how effective each model is at predicting the target at different value ranges. We can look at this like a distribution of the predictions that each model makes, ordered from lowest to highest predictions, and by any number of bins that we select in the Number of Bins dropdown list. Figure 5. Lift chart The Dual Lift chart is a mechanism for visualizing how two competing models perform against each other; that is, their degree of divergence in relative performance. So, whereas the Lift chart sorts predictions from lowest to highest for a single model, the Dual Lift chart sorts the rows by the magnitude of the difference between each of the two models’ prediction scores. What we see is that the plot color coding matches the color of the model at the top, and the divergence between the two widen from the left, flip over at the midpoint, and then widen again on the right. The Dual Lift chart is a good tool for assessing candidates for ensemble modeling. Finding different models with large divergences in the target rate (as shown with the orange line) could indicate good pairs of models to blend. That is, does a model show strength in a particular quadrant of the data? You might be able to create a strong ensemble by blending a model that is strong in an opposite quadrant. Figure 6. Dual Lift chart
View full article
Modeling Options (Advanced)
This article covers advanced options for modeling, which include Partitioning, Smart Downsampling, Feature Constraints, and Additional. Advanced Options After you upload your data and select a target variable, DataRobot will automatically choose what the best settings are for your specific dataset. If you wish to tweak these settings, click on the Show Advanced Options menu at the bottom of the page. Figure 1. Data page The Advanced Options menu will now be available. Figure 2. Advanced Options menu Partitioning The first tab selected will be Partitioning. Partitioning describes the method DataRobot uses to “clump” observations (or rows) together for evaluation and model building. DataRobot supports the following partitioning methods: Random Stratified Partition Feature Group Date/Time Random partitioning With Random partitioning, DataRobot randomly assigns observations (rows) to the training, validation, and holdout sets. By default, as you can see in Figure 2, random partitioning will initiate with 5 cross validation Folds and with a 20% holdout percentage. (BEWARE: This may change depending on the size of the dataset you upload.) All of the settings are completely customizable. Stratified partitioning With stratified partitioning, you sample each subpopulation of your dataset separately. This means, that if you have an imbalanced dataset with 10% of your observations being positive, all of your partitions will preserve that ratio of 9:1. Figure 3. Stratified partitioning options Stratified partitioning has the exact same settings as random partitioning. Partition Feature partitioning With Partition Feature option, you can pick a partition feature of your choosing and DataRobot will create a distinct partition for each unique value of that feature. This is useful when you want DataRobot to respect some partitioning you made outside of DataRobot. Figure 4. Partition Feature options The only limitation to the partition feature is that the partition feature should have a cardinality between 2 and 100. Group partitioning With Group partitioning, you choose a group feature and DataRobot ensures that all observations with the same value are in the same partition. This sounds similar to feature partitioning but the difference is that with group partitioning, you can have multiple values in the same partition. You will never have the same value in two partitions though. Figure 5. Group Partitioning options Date/Time partitioning With Date/Time partitioning, it is possible to train on an earlier portion of the dataset and test on the most recent data. This is useful when your model has a temporal dependency and you want to account for that by testing on the most recent data. Smart Downsampling Another advanced option is smart downsampling. To get there, just click the Smart Downsampling option. Figure 6. Smart Downsampling options Smart downsampling is useful when you have a big, imbalanced dataset. In those cases, you want to downsample the majority class to increase the speed of your models. DataRobot will assign weights to the downsampled class of your data thus ensuring that the reported accuracy metrics are not overestimating the accuracy of your models. To activate smart downsampling, click Downsample Data. You have the option to choose how much you want to downsample your data. Features Constraints Sometimes you want to force the directional relationship between a feature and the target. For example, higher home value should always lead to higher home insurance rate. Click Feature Constraints to access such functionality. Figure 7. Feature Constraints options For feature constraints to work, you will have to create feature lists with only numerical features. In this particular example, I created a feature list that I called ‘positives,’ and I chose that feature list from the Monotonic Increasing options. The models DataRobot builds will force this positive relationship between all of the features in ‘positives’ feature list and the target column. Additional To access Additional options, click Additional. Figure 8. Additional options, page 1 and Figure 9. Additional options, page 2 Figures 8 and 9 list a multitude of options you have when working with DataRobot. Most of them are self-explanatory, but even if they are not you can just hover your mouse over them to learn more. Let’s describe some of the options you are presented with. To start with, you can change the optimization metric of your project. By default, this will be LogLoss for binary classification problems. Furthermore, you can use `accuracy-optimized metablueprints` which should take longer to run but will probably create more accurate models. In addition to the above, you have scaleout models that can be deployed in a Hadoop environment. DataRobot also supports adding columns for weights, exposures, and offsets. This is only a small glimpse of the number of options you can use. If you are faced with a problem that requires specific tweaking, we recommend you take a closer look at these options. More Information If you’re a licensed DataRobot customer, search the in-app documentation for Show Advanced Options link.
View full article
NBA Player Performance (Regression)
This article summarizes how to solve a regression problem with DataRobot. Specifically, the topics include importing data, exploratory data analysis, and target selection, as well as modeling options, evaluation, interpretation and deployment. For this example we are using a historical data set from NBA games. This is a sports dataset. Within it is a combination of raw and engineered features in various sources. We're going to use this dataset to predict game_score, which is an advanced single statistic that attempts to quantify player performance and productivity. The different rows represent different players within this dataset and the columns represent features about those players. At the end of this dataset, we have our target column indicated in yellow: this is the outcome that we're trying to predict. The target here is a continuous variable, which makes this machine learning problem a ‘regression’ problem. Figure 1. Snapshot of training dataset Importing Data Figure 2. Data import options There are five ways to get data into DataRobot: Import data via a database connection using Data Source. Use a URL, such as an Amazon S3 bucket using URL. Connect to Hadoop using HDFS. Upload a local file using Local File. Create a project from the AI Catalog. Exploratory Data Analysis After you import your data, DataRobot will do an exploratory data analysis (EDA). This gives you the means, medians, unique, and missing values for each feature in your dataset. If you want to look at a feature in more detail, simply click on it and a distribution will drop down. Figure 3. Exploratory Data Analysis Target Selection When you are done exploring your features, it is time to tell DataRobot what the target feature is. You do this simply by scrolling up and entering it into the text field (as indicated in Figure 4). DataRobot will identify the problem type and give you a distribution of the target. Figure 4. Target Selection example Modeling Options At this point, you could simply hit the Start button to run Autopilot; however, there are some defaults that you can customize before building models. For example, under Advanced Options > Advanced, you can change the optimization metric: Figure 5. Optimization Metric Then also, under Partitioning, you can also change the default partitioning: Figure 6. Partitioning options Once you are happy with the modeling options and have pressed Start, DataRobot creates 30–40 models; it does this through a process of building something called blueprints (see Figure 7). Blueprints are a set of preprocessing steps and modeling techniques specifically assembled to best fit the shape and distribution of your data. Every model that the platform creates contains a blueprint. Figure 7. Blueprint example Model Evaluation The models that DataRobot created will be ranked on the Leaderboard (see Figure 8). You can find this under the Models tab. Figure 8. Leaderboard example After you select a model from the Leaderboard and examine the blueprint, the next step is to evaluate the model. You can find a set of the evaluation metrics typically used in data science under Evaluate > Residuals (Figure 9a) and Evaluate > Lift (Figure 9b). In the Residual chart you can see predicted and actual values. Figure 9a. Residuals Chart example In the Lift chart you can see how well the model fits across the prediction distribution. Figure 9b. Lift Chart example Model Interpretation Once you have evaluated your model for fit, it is time to take a look at how the different features are affecting predictions. You can find a set of interpretability tools in the Understand division. Feature Impact allows you to see which features are most important to your modeling. Figure 10. Feature Impact example This is calculated using model-agnostic approaches. You can do a feature impact analysis for every model that DataRobot creates. You can also examine how these features are affecting predictions using Feature Effects (shown in Figure 11), which is also in the Understand division. Below, you can see an example of how the number of in-patient visits increases the likelihood of readmission. This is calculated using a model-agnostic approach called partial dependence. Figure 11. Feature List Creation You can also examine how these features are affecting predictions using Feature Effects (shown in Figure 11), which is also in the Understand division. Below you can see an example of how the number of in patient visits increases the likelihood of readmission. This is calculated using a model agnostic approach called partial dependence Figure 12. Feature Effects example Feature Impact and Feature Effects show you the global impact of features on your predictions. You can find how features are impacting your predictions locally under Understand > Prediction Explanations (Figure 13). Figure 13. Prediction Explanations example Here you will find a sample of row-by-row explanations that tell you the reason for the prediction, which is very useful for communicating modeling results to non-data scientists. Someone who has domain expertise should be able to look at these specific examples and understand what is happening. You can get these for every row within your dataset. Model Deployment There are four ways to get data out of DataRobot under the Predict division The first is to use the GUI in the Make Predictions tab to simply upload scoring data and compute directly in DataRobot (Figure 14). You can then download the results with the push of a button. Customers usually use this for ad-hoc analysis or when they don’t have to make predictions very often. Figure 14. GUI Predictions You can create a REST API endpoint to score data directly from your applications in the Deploy tab (shown in Figure 15). An independent prediction server is available to support low latency, high throughput prediction requirements. You can set this up to score your data periodically. Figure 15. Create a Deployment Through the Deploy to Hadoop tab (Figure 16), you can deploy to Hadoop. Users who do this generally have large data and are using Hadoop. Figure 16. Hadoop Deployment Finally, using the Downloads tab, you can download the scoring code to score your data outside of DataRobot (shown in Figure 17). Customers who do this generally want to score their data off of a network or at a very low latency. Figure 17. Download Scoring Code
View full article
Next Best Offer (Multiclass)
This page summarizes how to build a multiclass classification project with DataRobot by going over the following topics: exploratory data analysis, target specification, modeling options, model evaluation, and model deployment. The goal is to highlight features specific for multiclass classification. For a general introductory demo to DataRobot, please watch this binary classification demo. The data to be modeled contains historic information about customers of a specific bank (See Figure 1). The bank hopes to use this data to build a model that can predict the next best communication action to take for each customer. The data has about 20,000 rows and 8 variables. Each row represents an individual customer. Each customer has one or more of the following attributes: age, marital status, income, credit rating, average spending, historic touch points with the bank, and the count of those touch points. The target variable to be predicted is ‘next_best_action.’ The data has numeric, categorical, and text variable types. Figure 1. A snapshot of the dataset to be modeled Next the spreadsheet is imported into DataRobot. This can be done by dragging the file onto the user interface or searching for it using a file browser. You could also fetch it from a URL if the file resides on a remote server. If the file was hosted on a local Hadoop or an enterprise JDBC-enabled database, DataRobot would have been able to fetch it from there using the HDFS or Data Source buttons, respectively. Figure 2. DataRobot data import options After DataRobot uploads the data, it does some Exploratory Data Analysis (EDA) on it and displays the results in the Project Data tab, where you can review the shape and distribution of each variable, apply basic transformations, and create derived features (See Figure 3). Figure 3. Results of EDA process for the income variable Next, DataRobot expects you to identify a target variable by typing it in the “Enter the target” box. DataRobot will immediately identify this as a classification problem, specifically multiclass classification problem because ‘next_best_action’ has 10 distinct categorical values. The distribution of the values in next_best_action are displayed for your convenience (See Figure 4). Figure 4. The target variable after DataRobot analyzes it At this point, you could press the Start button and use the default modeling options that DataRobot has selected for this project. Alternatively, you could customize the modeling options. In the Advanced Options tab you can customize the data partitioning strategy, the size of the hold-out versus the training dataset, and the optimization metric to name a few of the available options (Figure 5a and Figure 5b). Figure 5a. Advanced Options tab for customizing modeling options and Figure 5b. Advanced Options tab for customizing modeling options Once the modeling process has been customized and the Start button has been pressed, DataRobot creates about 30–40 entities called blueprints. These are a set of preprocessing steps and modeling techniques specifically assembled to best fit the shape and distribution of your data (Figure 6a and Figure 6b). The modeling techniques come from state of the art data science tools such as sklearn, XGBoost, DMTK, Eureqa, R, TensorFlow, Vowpal Wabbit, and more. Figure 6a. An example of a blueprint showing both the data preprocessing steps Figure 6b. An example of a blueprint showing the modeling algorithm DataRobot fits the blueprints to your data in a ‘survival of the fittest’ mode. The models that do best will survive the first round of this competition and get fed more data. The models that do well in the second round will get fed even more data and move to the third round, and so forth. At the end of this process all models are ranked by the chosen performance metric with the best models at the top of the list in the Leaderboard (Figure 7). DataRobot then builds blender (or ensemble) models on the best performing models. Figure 7. Leaderboard ranking of all models based on performance, with best performers at the top Once you have decided to move forward with a particular model, DataRobot provides multiple ways to review the model. In the Describe division, you can view the end-to-end model blueprint containing details of the specific feature engineering tasks and algorithms DataRobot used to run the model (See Figure 8). Figure 8. The end-to-end model In the Evaluate division, DataRobot provides a multiclass lift chart and confusion matrix for multiclass classification projects. The lift chart depicts how well a model segments the target population and how capable it is of predicting the target (Figure 9). By default, the lift chart sorts the predicted and actual values from low to high, and bins the result into 10 bins. If you want to see a different sorting order or number of bins, you can do that by toggling the respective buttons. For multiclass classification projects, you can use the Select Class dropdown to view the lift chart of a given class. Each class’s lift chart was calculated in a one-vs-all manner (up to the top 20 classes, by count of instances of class). (See Figure 9). Figure 9. Multiclass Lift Chart The multiclass confusion matrix compares actual data values with predicted data values, making it easy to see if any mislabeling has occurred and which values are affected (Figure 10). DataRobot reports class prediction results using different colored and sized circles. Color indicates prediction accuracy: green circles represent correct predictions while red circles represent incorrect predictions. The bigger the size of a circle, the greater the number of rows associated with it (See Figure 10). You can view and analyze additional details for a given class in the display to the right of the multiclass confusion matrix. Just click on any of the green circles in the multiclass confusion matrix and DataRobot will display a smaller confusion matrix associated with that class. The same can be achieved by clicking the dropdown arrow under Details for selected class. The data for the confusion matrix is sourced from the validation, cross-validation, or holdout (if unlocked) partitions, and it can be viewed in three modes: Global: provides F1 Score, Recall, and Precision metrics for each class. Actual: provides details of the Recall score and a partial list of classes that the model confused with the selected class. Predicted: provides details of the Precision score. Figure 10. Multiclass Confusion Matrix In the Understand division, the Feature Impact tab helps explain what drives the model’s predictions. Feature Impact measures how much each feature contributes to the overall accuracy of the model. For multiclass projects you can view the impact each feature has on the overall model performance, or the performance of an individual class as shown in Figure 11. Figure 11. Feature Impact for multiclass classification Once the model has been reviewed it can be deployed in a number of ways (see Figure 12a—Figure 12d): You can upload a new dataset to DataRobot to be scored in batch and downloaded (Figure 12a). You can create a REST API endpoint to score data directly from your applications (Figure 12b). An independent prediction server is available to support low latency, high throughput prediction requirements. You can export the model for in-place scoring in Hadoop (Figure 12c), or You can download scoring code (Figure 12d), either as editable source code or self-contained executables, to embed directly in applications to speed up computationally intensive operations. Figure 12a. Score in a batch Figure 12b. Score via REST endpoint Figure 12c. In-place scoring with Hadoop Figure 12d. Download scoring code
View full article
How Can I Evaluate My Model?
This article explains model evaluation techniques, including Lift Chart, ROC Curve, Prediction Distribution graphs, and Cumulative Lift and Cumulative Gain charts. These are all calculated after Autopilot has finished running on the data. Lift Chart To find the Lift chart, click on the model of interest from the Leaderboard. Figure 1. Leaderboard Then click the Evaluate division. The Lift chart is the first chart shown on the page. Figure 2. Lift chart This chart sorts the predictions the model made from lowest to highest and then groups them into bins. The blue and orange lines depict the average predicted and average real probability (respectively) for a particular bin. A good Lift chart would have the orange and blue lines “hugging,” as that would mean your model is making predictions close to the real values. ROC Curve To continue exploring how well the model is performing, click the ROC Curve tab. Figure 3. ROC Curve tab The ROC Curve tab provides a rich selection of methods to assess model performance. Common evaluation metrics On the top left you have the absolute values for some common evaluation metrics. Figure 4. Common evaluation metrics Confusion Matrix On the top right-hand corner, we have the confusion matrix. Figure 5. Confusion Matrix This is useful for inspecting how many False Positives and False Negatives your model is producing. ROC Curve On the bottom left of the ROC Curve tab, we have the ROC Curve plot. Figure 6. ROC Curve Which directly benchmarks this model with a baseline one. Prediction Distribution In the middle of the graphs, we have the Prediction Distribution plot. Figure 7. Prediction Distribution Plot The Prediction Distribution graph can help you find the optimal probability threshold for your target variable. Cumulative Lift and Cumulative Gain charts The Cumulative Lift and Cumulative Gain charts are on the bottom right-hand corner. Figure 8. Cumulative Lift chart Figure 9. Cumulative Gain chart These charts tell you how many times your effectiveness increases by using this model instead of a naive method. More Information If you’re a licensed DataRobot customer, search the in-app documentation for Lift Charts and ROC Curve.
View full article
EDA: Histograms to Help You Understand and Prep Your Data
Exploratory analysis is a critical piece of your data prep work. At a minimum, you want to perform some initial investigations to spot potential patterns and anomalies across your entire dataset (not just a sample), from which you can begin to form hypotheses. With DataRobot Paxata, your exploratory work is made easier through our powerful Filtergrams tool, which produces histograms for your dataset's variables. Filtergrams are also super useful for data cleansing work after your initial exploration is complete. Filtergrams are used to identify and redress missing or incomplete values, and to flag unwanted observations in a dataset that can then be removed using the Remove Rows tool. For a deep dive on every angle of filtergrams, of which there are many, check out our official Data Filtergrams documentation. Example time Let's explore a simple example of how to open a filtergram in DataRobot Paxata, use it for exploratory analysis, and then cover a couple of quick examples of how to further clean our data based on what the histogram reveals. Let's look at a dataset with medicare data. Just for convenience of quickly illustrating what kind of data we're looking at, I've used the columns tool to quickly peek at all of the variables in this dataset. You'll find the columns tool on the left side of the screen and it looks like this: Quick ProTip: The columns tool is a super-powerful tool that allows you to manipulate the feature variables in your data (e.g., removing ones you don't want to keep). But it's also a great little tool for quickly viewing all of the variables, and their types, in your dataset, which is much faster than scrolling across the dataset to view each one on the data grid. See Columns in a DataRobot Paxata Project for a deep dive on the tool. So here are the variables (columns) in the data we're going to explore: In this example, I'm interested in "Period" because I happen to know that it has data related to hospital admission and discharge dates, which tie into the average spend data. I want to better understand all of the values for that variable because I was surprised to see (in the image above) that it's a text type variable. I was expecting it be numeric, i.e., a number to indicate a period of time. Using a filtergram, I can quickly produce a list of every value used for the "Period" variable. To open the filtergram, first ensure you're looking at the data grid. Then, click the down arrow for the variable's column and select FILTER values: A histogram opens and reveals that "Period" has been coded in these fairly untidy four ways: Now what? I think it makes sense for me to replace each of these 'period' string values with a numeric value. For example, "During Index Hospital Admission" = 1 "1-3 days Prior to Index Hospital Admission" = 2 etc. This replacement process is simple to do in DataRobot Paxata: Perform a find and replace for each of the text string instances and replace with the numeric assignment. The Find and Replace option is in the variable's dropdown menu—the same menu from which you opened the filtergram. ProTip: You can double-click in any cell to automatically invoke the Find and Replace menu. After you have all of the string values replaced with your numeric assignments, convert the column to type numeric. You can also do this from the column menu: CHANGE into... numeric: With those two transformations complete, I open a filtergram once more for the variable to ensure the resulting transformations are as I want them. And now I see not only more tidy values for my "Period" variable, I also see that a new type of histogram is displayed—because this variable is now of type numeric. And with that variable tidy, I can carry on with my prep work. :)
View full article
Looking for live, instructor-led classes? See details:
View All ≫
© 2020 DataRobot, Inc
Terms of Service