How are dataRobot users working with time-series models dealing with the following statements
Series "state" contains multiple rows with the same timestamp. Consider removing duplicate dates first, or aggregating the data to distinct time values (e.g., daily rows).
Hi @ChrisKehl - trying to understand your question so we get you the right guidance. It looks to me like this is copied from somewhere but can't find the source. Can you give us some more information/context around this content?
"Series "state" contains multiple rows with the same timestamp. Consider removing duplicate dates first, or aggregating the data to distinct time values (e.g., daily rows)."
will look for your answer!
Hello, thanks for getting back wit me. I am trying to import data for a time-series model. The data loads, and I select the target select the series ID. While the series ID runs it populates this message.
Chris, what is your data? It sounds like DataRobot thinks this is a duplicate line because the same series identifier and time exist. Eg. daily store sales:
store_id,date,sales 3,1-Dec-2020,100 4,1-Dec-2020,104 4,1-Dec-2020,104 5,1-Dec-2020,97
Store 4 I would expect to trigger some kind of error along these lines.
Can you also provide some details on what granularity you are trying to use for your time-series problem?
Here are two common examples where we see users unclear in how to structure a time series problem:
1. Often customers will forecast individual items (SKUs) on a store level. If you selected the SKU-id as your series id, then you'd have duplicated time-stamps from the various stores. In that case, you'd create a new unique identifier column that was a concatenation of store_id + SKU_id ('Store_1_SKU_1') and use that as the series id, which would eliminate the 'duplicate timestamp' problem.
2. Alternatively, you might have transaction-level data, where each transaction for every item is present as an individual row in your dataset. If you want to predict the item or amount for the next transactions, then this dataset structure may be appropriate. If not, you'd need to think about aggregating the dataset. Maybe you want to aggregate sales of each item to the hourly level, daily, or weekly level. Perhaps you'd rather aggregate across all items within a store, or aggregate across all stores for a given item. There are many different ways you can structure the problem.
Can you briefly describe the dataset and problem you're trying to predict? That way I can help walk you through adjusting your dataset.
Thank you for the reply. I have signed up for the free trial so I can use the paxata to see if it makes aggregation easier. The data set i'm using is from kaggle future sales. https://www.kaggle.com/c/competitive-data-science-predict-future-sales/overview
So the submission csv is ID and itm_cnt_month
@ChrisKehl I grabbed the data from that competition, and ran through the process using Paxata to prep the data for modeling.
Process is as follows using DataRobot Paxata and DataRobot Automated Time Series modeling.
DataPrep and Export using Paxata
1. Get the datasets downloaded locally (for convenience here, lots of data import options)
2. Navigate to Paxata -> Library. Add Dataset
3. Next, select 'Upload Local File', and then select all of the files you want to add.
4. Now Paxata will upload the data and process each file or dataset. There are lots of options for controlling the import config, but we'll leave everything as default here before clicking 'Finish'.
5. After you've pressed 'Finish', the datasets will be available in the Dataset Library. The next step will be to create a 'Project' in Paxata. You can do that several ways, but we'll do this here by starting a project from an individual dataset. To do this, you'll just click the 'Create Project' button on the dataset.
6. You'll give the project a name and description, and then be transported to a new 'Project' window and view.
7. Next up is to join in the additional datasets that are relevant here (items, item categories, shops, etc). You do that by selecting the 'attach' button (paperclip symbol on the left-hand panel) and selecting 'join'.
8. Now you select the dataset you'd like to join in, for ease I am using datasets that I also updated to the library. Click the 'select' button on the dataset you'd like to join.
9. Now configure how you'd like to join the datasets. Again, lots of options here but default settings are used here. I just need to select the columns in each dataset that I'd like to use to join. Using the drop-down menu for each dataset, I select one or multiple columns from each dataset to use as identifiers or keys. You'll see a preview of what the dataset will look like with these join conditions.
Then click 'Save' and you'll move on.
10. Now the dataset shown includes the new columns. Repeat this join-process for any other datasets you want to include. In this case, I've started with 'sales_train' and joined in 'items', 'shops', and 'item_categories'.
12. Now I'll create a 'Series ID' column by concatenating the 'shop_id' and 'item_id' columns. Here, I am doing this using the 'Compute' function.
In this 'Compute' operation, I am selecting the 'CONCATENATE()' function, clicking the column names in the preview dataset, and adding an underscore as a spacer.
Here is what the completed 'Compute' operation looks like.
'Save' this compute operation and move on.
13. Now the data would be ready for a traditional machine learning problem, but because this is a Time Series problem we will want to aggregate the data to the minimum date-step represented here in the 'date' column. To do that, we'll use a 'Window' -> 'Aggregate' operation.
You'll need to configure the 'groupby' columns for this aggregation, and also specify what sort of aggregation operations you want to perform on columns from the dataset. For example, here I am taking a 'Sum' of the 'item_cnt_day' column (our target) and calling that new column 'Target'. There are many different operations available for numerical and non-numerical data.
Categorical or numerical data might be handled well by creating 'Array' operations that provide aggregation-encoded representations of that previous row-level data
But there is lots of room for experimentation here for you to decide which features you want to provide to DataRobot for modeling, and what sort of operations you think might be relevant or useful.
14. After competing the Window-Aggregation configuration and saving the operation, we can 'Publish' the dataset for use in modeling. To do that, click 'Publish'.
Provide a name and details for your records and to make life easier on your future-self. Then click 'Publish' again and wait a few moments for the dataset to appear in the Library of datasets.
15. Now in the Library of Datasets, click 'Export' on the dataset that you've just created.
Select where you'd like to export the dataset. In this case, I'll send it directly to the AI Catalog of my DataRobot account ('Jarred AI Catalog'
Once I've selected the dataset, I'll be back to pick format and name and Export to the AI Catalog by pressing the 'Export' button.
1. Start a project. Navigate to app.datarobot.com, and you'll be presented with a number of different options.
Because we just exported our dataset to the AI Catalog, we'll use that option. Click the orange 'AI Catalog' button.
Next, select the dataset and click 'Create Project'
2. Configure the project. Because there are lots of options here, I'll just show you the basic config.
Set the Target: 'Target' and select 'Time Aware Modeling' and 'date' column. Next, select 'Time Aware Modeling'.
3. DataRobot will now detect that this is a multi-series dataset, and suggest a potential column to use as a series identifier. Previously, we created the 'Series_ID' column for this purpose.
DataRobot has detected that this dataset is irregular because there are many missing dates across the 400k+ series. This means that DataRobot will run in 'row-based' mode and handle the problem differently.
While the goal of this competition is to predict sales for each 'item_id' at each 'shop_id' for a month, the dataset is not structured to forecast this yet. This dataset is quite messy, large, and not already structured for Time Series modeling problem you'd like to solve, you'd need to do quite a bit of other work to get this modeling ready.
While many of those things will be handled automated within DataRobot, there are some things you need to make decisions about.
As an example, there are a number of 'missing dates' where a record of an 'item_id' for a 'shop_id' does not exist. You need to decide what you'd like to do about that. Likely, the record keeping system didn't record zero-sales in the database, and you should decide how you want to handle that. Do you want to build models that are aware of zero-sales days and can predict them? Or, would you like to build a model only to predict what sales will be when you know that sales are non-zero (not including any zero-sales records). If you want to add those zero-sales records, should you do it before the first-sale of an item was observed, or after the last record was observed (zero-padding?)? Should you only add them between the first and last records for each 'item_id' + 'shop_id' combo?
Those decisions are somewhat outside of the scope of this post, but should give you the technical run-through to make some of those decisions and test out the results.
Let me know if this doesn't resolve your questions.
Finally, DataRobot can handle a lot of data clean-up but there are some bounds to its capabilities. In this case, this dataset has so much irregularity of the data that it can't be modeled on (in essence, too many massive holes in the data records). If you fill the missing dates between first and last record for each series with zero-values, or even create records with a missing target value, then this dataset should be ready for some modeling.
I appreciate the tutorial, but I can't get past step 3. The sales_train must exceed my 14 day trial period. I have attached the message I get. I may have to aggregate these steps in r.
Ah, I see. Yes this dataset is quite large (400k+ series, 360k rows and ~3M rows when you fill the missing dates.) and may exceed the trial-limits for Paxata. I'd be happy to send you a fully-processed version of the dataset if you'd like to explore the modeling-side in DataRobot. If you're interested, send me an email at firstname.lastname@example.org and I'll be happy to send over the file I cleaned up.
@ChrisKehl As a last point, I'll just mention that the Kaggle dataset linked is going to be generally unpleasant to work with because most of the series are so short (among 400k+ series mean duration is ~7 days, median 3 days, max 867), so it'll be a highly zero-inflated problem. This will be best handled using a zero-inflated modeling strategy where a first model is predicting 'is this a zero-sales day' and then a second model will predict 'given that its non-zero sales, what are the sales?' We have approaches that do this automatically, but its not really the most indicative dataset for Time Series forecasting based on my experience, especially with the degree that this dataset is skewed toward zero-inflated for about 97% of the data points.
Something like this Walmart Recruiting Sales Forecasting (https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data?select=train.csv.zip) data set might be a better place to start as it has interesting covariates, is multi-series, and isn't mostly zeros or missing values.