cancel
Showing results for 
Search instead for 
Did you mean: 

aggregating the data

Image Sensor

How are dataRobot users working with time-series models dealing with the following statements

 

Series "state" contains multiple rows with the same timestamp. Consider removing duplicate dates first, or aggregating the data to distinct time values (e.g., daily rows).

Labels (1)
0 Kudos
15 Replies
Community Team
Community Team

Hi @ChrisKehl - trying to understand your question so we get you the right guidance. It looks to me like this is copied from somewhere but can't find the source. Can you give us some more information/context around this content? 

"Series "state" contains multiple rows with the same timestamp. Consider removing duplicate dates first, or aggregating the data to distinct time values (e.g., daily rows)."

will look for your answer!

- linda

0 Kudos
Image Sensor

Hello, thanks for getting back wit me.  I am trying to import data for a time-series model. The data loads, and I select the target select the series ID.  While the series ID runs it populates this message.

DataRobot Employee
DataRobot Employee

Chris, what is your data?  It sounds like DataRobot thinks this is a duplicate line because the same series identifier and time exist.  Eg. daily store sales:

store_id,date,sales
3,1-Dec-2020,100
4,1-Dec-2020,104
4,1-Dec-2020,104
5,1-Dec-2020,97

Store 4 I would expect to trigger some kind of error along these lines. 

0 Kudos
Data Scientist
Data Scientist

@ChrisKehl 

Can you also provide some details on what granularity you are trying to use for your time-series problem?

Here are two common examples where we see users unclear in how to structure a time series problem:
1. Often customers will forecast individual items (SKUs) on a store level. If you selected the SKU-id as your series id, then you'd have duplicated time-stamps from the various stores. In that case, you'd create a new unique identifier column that was a concatenation of store_id + SKU_id ('Store_1_SKU_1') and use that as the series id, which would eliminate the 'duplicate timestamp' problem. 

2. Alternatively, you might have transaction-level data, where each transaction for every item is present as an individual row in your dataset. If you want to predict the item or amount for the next transactions, then this dataset structure may be appropriate. If not, you'd need to think about aggregating the dataset. Maybe you want to aggregate sales of each item to the hourly level, daily, or weekly level. Perhaps you'd rather aggregate across all items within a store, or aggregate across all stores for a given item. There are many different ways you can structure the problem.

Can you briefly describe the dataset and problem you're trying to predict? That way I can help walk you through adjusting your dataset.

0 Kudos
Image Sensor

Thank you for the reply.  I have signed up for the free trial so I can use the paxata to see if it makes aggregation easier. The data set i'm using is from kaggle future sales. https://www.kaggle.com/c/competitive-data-science-predict-future-sales/overview

So the submission csv is ID and itm_cnt_month

0 Kudos
Data Scientist
Data Scientist

@ChrisKehl Thanks for the dataset info. I'll take a look and get back to you with specific instructions for what needs to be adjusted in the dataset.

0 Kudos
Image Sensor
if you have a good tutorial on data aggregation with paxata, please post.
0 Kudos
Image Sensor

Thanks, I’m setting it up for time series

0 Kudos
Data Scientist
Data Scientist

@ChrisKehl I grabbed the data from that competition, and ran through the process using Paxata to prep the data for modeling.

Process is as follows using DataRobot Paxata and DataRobot Automated Time Series modeling.

DataPrep and Export using Paxata

1. Get the datasets downloaded locally (for convenience here, lots of data import options)
2. Navigate to Paxata -> Library. Add Dataset

Screen Shot 2020-12-21 at 10.49.18 AM.png

3. Next, select 'Upload Local File', and then select all of the files you want to add.

Screen Shot 2020-12-21 at 10.49.44 AM.png

4. Now Paxata will upload the data and process each file or dataset. There are lots of options for controlling the import config, but we'll leave everything as default here before clicking 'Finish'.

Screen Shot 2020-12-21 at 10.50.31 AM.png

5. After you've pressed 'Finish', the datasets will be available in the Dataset Library. The next step will be to create a 'Project' in Paxata. You can do that several ways, but we'll do this here by starting a project from an individual dataset. To do this, you'll just click the 'Create Project' button on the dataset.

Screen Shot 2020-12-21 at 11.02.01 AM.png

6. You'll give the project a name and description, and then be transported to a new 'Project' window and view.

Screen Shot 2020-12-21 at 11.12.32 AM.png

7. Next up is to join in the additional datasets that are relevant here (items, item categories, shops, etc). You do that by selecting the 'attach' button (paperclip symbol on the left-hand panel) and selecting 'join'.

Screen Shot 2020-12-21 at 11.12.40 AM.png

8. Now you select the dataset you'd like to join in, for ease I am using datasets that I also updated to the library. Click the 'select' button on the dataset you'd like to join.

Screen Shot 2020-12-21 at 11.12.58 AM.png

9. Now configure how you'd like to join the datasets. Again, lots of options here but default settings are used here. I just need to select the columns in each dataset that I'd like to use to join. Using the drop-down menu for each dataset, I select one or multiple columns from each dataset to use as identifiers or keys. You'll see a preview of what the dataset will look like with these join conditions.

Screen Shot 2020-12-21 at 11.19.38 AM.png

Then click 'Save' and you'll move on.

Screen Shot 2020-12-21 at 11.19.38 AM.png

10. Now the dataset shown includes the new columns. Repeat this join-process for any other datasets you want to include. In this case, I've started with 'sales_train' and joined in 'items', 'shops', and 'item_categories'.

Screen Shot 2020-12-21 at 11.24.05 AM.png

12. Now I'll create a 'Series ID' column by concatenating the 'shop_id' and 'item_id' columns. Here, I am doing this using the 'Compute' function.

Screen Shot 2020-12-21 at 11.27.06 AM.png

In this 'Compute' operation, I am selecting the 'CONCATENATE()' function, clicking the column names in the preview dataset, and adding an underscore as a spacer.

Screen Shot 2020-12-21 at 11.23.02 AM.png

Here is what the completed 'Compute' operation looks like.

Screen Shot 2020-12-21 at 11.25.25 AM.png

'Save' this compute operation and move on.

13. Now the data would be ready for a traditional machine learning problem, but because this is a Time Series problem we will want to aggregate the data to the minimum date-step represented here in the 'date' column. To do that, we'll use a 'Window' -> 'Aggregate' operation.

Screen Shot 2020-12-21 at 11.29.00 AM.png

You'll need to configure the 'groupby' columns for this aggregation, and also specify what sort of aggregation operations you want to perform on columns from the dataset. For example, here I am taking a 'Sum' of the 'item_cnt_day' column (our target) and calling that new column 'Target'. There are many different operations available for numerical and non-numerical data.  

Screen Shot 2020-12-21 at 11.08.34 AM.png

Categorical or numerical data might be handled well by creating 'Array' operations that provide aggregation-encoded representations of that previous row-level data

Screen Shot 2020-12-21 at 11.09.06 AM.png

But there is lots of room for experimentation here for you to decide which features you want to provide to DataRobot for modeling, and what sort of operations you think might be relevant or useful.

14. After competing the Window-Aggregation configuration and saving the operation, we can 'Publish' the dataset for use in modeling. To do that, click 'Publish'.

Screen Shot 2020-12-21 at 11.35.53 AM.png

Provide a name and details for your records and to make life easier on your future-self. Then click 'Publish' again and wait a few moments for the dataset to appear in the Library of datasets.

Screen Shot 2020-12-21 at 11.35.58 AM.png

 15. Now in the Library of Datasets, click 'Export' on the dataset that you've just created.

Screen Shot 2020-12-21 at 11.10.30 AM.png

Select where you'd like to export the dataset. In this case, I'll send it directly to the AI Catalog of my DataRobot account ('Jarred AI Catalog'  

Screen Shot 2020-12-21 at 11.10.50 AM.png

Once I've selected the dataset, I'll be back to pick format and name and Export to the AI Catalog by pressing the 'Export' button.

Screen Shot 2020-12-21 at 10.45.48 AM.png

 

Modeling time...

1. Start a project. Navigate to app.datarobot.com, and you'll be presented with a number of different options. 

Screen Shot 2020-12-21 at 11.44.02 AM.png

Because we just exported our dataset to the AI Catalog, we'll use that option. Click the orange 'AI Catalog' button.

Screen Shot 2020-12-21 at 11.44.02 AM.png

Next, select the dataset and click 'Create Project'

Screen Shot 2020-12-21 at 11.31.24 AM.png

2. Configure the project. Because there are lots of options here, I'll just show you the basic config.
Set the Target: 'Target' and select 'Time Aware Modeling' and 'date' column. Next, select 'Time Aware Modeling'.

3. DataRobot will now detect that this is a multi-series dataset, and suggest a potential column to use as a series identifier. Previously, we created the 'Series_ID' column for this purpose.

Screen Shot 2020-12-21 at 11.41.54 AM.png

DataRobot has detected that this dataset is irregular because there are many missing dates across the 400k+ series. This means that DataRobot will run in 'row-based' mode and handle the problem differently. 

 While the goal of this competition is to predict sales for each 'item_id' at each 'shop_id' for a month, the dataset is not structured to forecast this yet. This dataset is quite messy, large, and not already structured for Time Series modeling problem you'd like to solve, you'd need to do quite a bit of other work to get this modeling ready.

While many of those things will be handled automated within DataRobot, there are some things you need to make decisions about.

As an example, there are a number of 'missing dates' where a record of an 'item_id' for a 'shop_id' does not exist. You need to decide what you'd like to do about that. Likely, the record keeping system didn't record zero-sales in the database, and you should decide how you want to handle that. Do you want to build models that are aware of zero-sales days and can predict them? Or, would you like to build a model only to predict what sales will be when you know that sales are non-zero (not including any zero-sales records). If you want to add those zero-sales records, should you do it before the first-sale of an item was observed, or after the last record was observed (zero-padding?)? Should you only add them between the first and last records for each 'item_id' + 'shop_id' combo?

Those decisions are somewhat outside of the scope of this post, but should give you the technical run-through to make some of those decisions and test out the results.

Let me know if this doesn't resolve your questions.

 



 

0 Kudos
Data Scientist
Data Scientist

Sorry, I forgot to include this image of the DataRobot Project configuration first steps:

Screen Shot 2020-12-21 at 12.05.37 PM.png

0 Kudos
Data Scientist
Data Scientist

Finally, DataRobot can handle a lot of data clean-up but there are some bounds to its capabilities. In this case, this dataset has so much irregularity of the data that it can't be modeled on (in essence, too many massive holes in the data records). If you fill the missing dates between first and last record for each series with zero-values, or even create records with a missing target value, then this dataset should be ready for some modeling.

Screen Shot 2020-12-21 at 12.09.50 PM.png

0 Kudos
Image Sensor

Great, thank you for the help.  I will try to see what I can figure out with paxata to see if it will add in the missing values.

0 Kudos
Image Sensor

I appreciate the tutorial, but I can't get past step 3.  The sales_train must exceed my 14 day trial period.  I have attached the message I get.  I may have to aggregate these steps in r. Screen Shot 2020-12-21 at 8.36.48 PM.png

0 Kudos
Data Scientist
Data Scientist

Ah, I see. Yes this dataset is quite large (400k+ series, 360k rows and ~3M rows when you fill the missing dates.) and may exceed the trial-limits for Paxata. I'd be happy to send you a fully-processed version of the dataset if you'd like to explore the modeling-side in DataRobot. If you're interested, send me an email at jarred.bultema@datarobot.com and I'll be happy to send over the file I cleaned up.

Jarred

0 Kudos
Data Scientist
Data Scientist

@ChrisKehl As a last point, I'll just mention that the Kaggle dataset linked is going to be generally unpleasant to work with because most of the series are so short (among 400k+ series mean duration is ~7 days, median 3 days, max 867), so it'll be a highly zero-inflated problem. This will be best handled using a zero-inflated modeling strategy where a first model is predicting 'is this a zero-sales day' and then a second model will predict 'given that its non-zero sales, what are the sales?' We have approaches that do this automatically, but its not really the most indicative dataset for Time Series forecasting based on my experience, especially with the degree that this dataset is skewed toward zero-inflated for about 97% of the data points.

 

Something like this Walmart Recruiting Sales Forecasting (https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data?select=train.csv.zip) data set might be a better place to start as it has interesting covariates, is multi-series, and isn't mostly zeros or missing values.