(Article updated October 2020.)
The DataRobot Automated Time Series product accelerates your AI success by combining cutting-edge machine learning and automation with the team you already have in place. Automated Time Series incorporates the knowledge, experience, and best practices of the world's leading data scientists, delivering unmatched levels of automation, accuracy, transparency, and collaboration to help your business become an AI-driven enterprise.
This guide will demonstrate the basics of how to build, select, deploy, and monitor a time series model using the automated machine learning capabilities of DataRobot. Time series forecasting is one of the most valuable yet difficult problems in data science that businesses face today. This means that many users are missing out on these capabilities due to lack of expertise and resources. DataRobot solves this challenge, and puts this technology into the hands of both novice users and experienced data scientists.
Time series models learn from recent history to forecast future values. The data for time series use cases comes in many different shapes ranging from daily data to individual transactions.
The use case that will be highlighted throughout these examples comes from the retail industry where we will forecast store sales for the next seven days. Accurately forecasting sales allows companies to do more than just prevent overstocking: it enables businesses to assess store performance while also managing staffing, inventory, and their supply chain. With the automated time series capabilities of DataRobot, we can quickly create an accurate forecasting model across thousands of different stores or product lines, evaluate the most important factors that impact sales, and get future predictions.
Since we want to predict future daily sales for each store, we need to aggregate raw transactions into daily totals. An example file is shown below and includes an identifier for store, our date column, the daily number of sales, and many other attributes about the stores, internal promotions, and external factors like holidays or the level of inflation. (Figure 1) As is often the case, adding more data will increase the predictive power of your model. It's always easier to start with the data you have available today and experiment with other information once you have a working model.
To get started with DataRobot, you will log in and load a prepared training dataset. DataRobot currently supports .csv, .tsv, .dsv, .xls, .xlsx, .sas7bdat, .bz2, .gz, .zip, .tar, and .tgz file types, plus Apache Avro, Parquet, and ORC (Figure 1). (Note: If you wish to use Avro or Parquet data, contact your DataRobot representative for access to the feature.)
Directly loading data from production databases for model building allows you to quickly train and retrain models, and eliminates the need to export data to a file for ingestion.
DataRobot supports any database that provides a JDBC driver—meaning most databases on the market today can connect to DataRobot. Drivers for Postgres, Oracle, MySQL, Amazon Redshift, Microsoft SQL Server, Snowflake, kdb+, and Hadoop Hive are most commonly used (Figure 2).
After you load your data, DataRobot performs exploratory data analysis (EDA), detecting the data types and showing the number of unique, missing, mean, median, standard deviation, and minimum and maximum values. This information is helpful for getting a sense of the dataset shape and distribution (Figure 3).
Next, select a prediction target (what you are trying to forecast) from the uploaded dataset. DataRobot will analyze your training dataset and automatically determine the type of analysis (in this case, regression).
DataRobot will automatically recognize date and/or time-based features and ask if you want to set up time-aware modeling. To do so, simply select the recommended feature (i.e., the “Date” in our dataset); DataRobot will display a chart of your prediction target across time (Figure 4).
DataRobot allows you to set the forecast point, i.e., the moment in time when you want to make a prediction. In our case, we want to predict sales for the next seven days because that’s how often we need to restock our stores. The setting on the right side of the window below ensures the model will generate predictions for each of the next seven days. In other words, if we make our forecast on Sunday, we'll get the predicted sales by store for the next day (Monday) and the coming week (Figure 5).
The left side of the window below will determine the types of features DataRobot will build. The defaults work well, but you can always adjust them for the problem at hand.
Two rules of thumb here:
To prevent target leakage, and include valuable features in the model, DataRobot allows you to specify features that are known ahead of time (“known in advance” features). In our example, we know about upcoming marketing events, holidays, and that the square footage for the store won't change. We can select all of these variables and set them as “known in advance” variables. These variables add a lot to the predictive power of the model. For example, telling DataRobot that there is a holiday coming up in the next seven days can be used to improve the predictions for our sales forecast. Holidays and special event calendars can be uploaded separately.
The default modeling mode is “Quick,” which employs a very effective and efficient use of DataRobot’s automation capabilities. For more control over which algorithms DataRobot runs, there are Manual and full Autopilot options. If you want to further customize the model building process, you can modify a variety of advanced parameters, optimization metrics, feature lists, transformations, partitioning, and sampling options.
Click the Start button to begin training models. Once the modeling process begins, DataRobot analyzes the target and implements time series best practices. DataRobot also creates time-based features to use in the different blueprints.
You can easily see how many features contain useful information, and edit feature lists used for modeling (Figure 6).
There are also options to drill down on variables to view distributions and trends (Figure 7).
In addition to traditional time series models like ARIMA, DataRobot automatically builds modern algorithms such as XGBoost, Light GBM, Keras LSTM, TensorFlow, DeepAR, as well as proprietary models like Eureqa, and even open source Prophet models from Facebook that can be compared directly to the traditional models. DataRobot optimizes data automatically for each algorithm, performing operations like one-hot encoding, missing value imputation, text mining, and standardization to transform features for optimal results.
DataRobot streamlines model development by automatically ranking models (or ensembles of models) based on model performance for backtesting and holdout partitions. By cost-effectively evaluating a near-infinite combination of data transformations, features, algorithms, and tuning parameters in parallel across a cluster of servers, DataRobot delivers the best predictive model in the shortest amount of time.
After automated modeling is complete, the models Leaderboard (Figure 8 ) will rank each machine learning model so you can then evaluate and select the one you want to use. Click on a model and you have options to Evaluate, Understand, Describe, and Predict.
To estimate possible model performance, the Evaluate options include industry standard Lift Chart, Feature Fit, Accuracy over Time (Figure 9), Forecast vs. Actual, and Advanced Tuning. There are also options for measuring models by Learning Curves, Speed versus Accuracy, and Comparisons. The interactive charts to evaluate models are very detailed, but don't require a background in data science in order to understand what they convey.
DataRobot offers superior transparency, interpretability, and explainability so you easily understand how models were built, and have the confidence to explain to others why a model made the prediction it did.
In the Describe tab, you can view the end-to-end model blueprint containing details of the specific feature engineering tasks and algorithms DataRobot uses to run the model (Figure 10).
In the Understand tab, popular exploratory capabilities include Feature Impact, Feature Effects, Prediction Explanations, and Word Cloud. These all help you understand what drives the model’s predictions.
Feature Impact measures how much each feature contributes to the overall accuracy of the model. For example, the reason why a patient was discharged from a hospital has a direct relationship to the likelihood of a patient being readmitted to the hospital. This insight can be invaluable for guiding an organization to focus on what matters most (Figure 11).
The Feature Effects chart displays model details on a per-feature basis (a feature's effect on the overall prediction), depicting how a model understands the relationship between each variable and the target (Figure 12). It provides specific values within each column that are likely large factors in determining sales over the next seven days.
Every model built in DataRobot is immediately ready for deployment (Figure 13). You can:
We can easily explore the sales predictions for the next seven days for each store (Figure 14), download the values, and understand our confidence using our estimated prediction interval (blue area).
With DataRobot you can proactively monitor and manage all deployed machine learning models (including models created outside of DataRobot) to maintain peak prediction performance (Figure 15). This ensures that the machine learning models driving your business are accurate and consistent throughout changing market conditions.
At a glance you can view a summary of metrics from all models in production, including the number of requests (predictions) and key health statistics:
From here you can apply “embedded DataRobot data science” expertise to review model performance and detect model decay. By clicking on a model you can see how the predictions the model has made have changed over time. Dramatic changes here can indicate that your model has gone off track.
You can also analyze data drift (Figure 16) to assess if the model is reliable, even before you get the actual values back. You’re essentially analyzing the difference between the data you’ve scored this model on vs. the data the model was trained on. DataRobot compares the most important features in the model (as measured by its Feature Impact score) and how different each feature’s distribution is from the training data.
If you decide to replace a model that has drifted, simply paste the URL from a re-trained DataRobot model (a model trained on more recent data from the same data source), or from one that has compatible features (Figure 17). After DataRobot validates that the model matches you can select a reason why you made the replacement for a permanent archive. From this point forward, new prediction requests will go against the new model with no impact to downstream processes. If you ever decide to restore the previous model, you can easily do that through the same process.
DataRobot’s time series capabilities are available as part of a fully-managed software service (SaaS), or in several Enterprise configurations to match your business needs and IT requirements. All configurations feature a constantly expanding set of diverse, best-in-class algorithms from R, Python, Spark, Eureqa, and other sources, giving you the best set of tools for your machine learning and AI challenges.
DataRobot can also automate the development of sophisticated regression and classification models when time series calculations are not required. A similar overview document is available that describes how general regression and classification models can be built in DataRobot. In addition to using the GUI, you can achieve everything covered in this document with Python or R through the API. If you have any questions then leave a comment below.
Attachment: We've attached a PDF file of this article.