cancel
Showing results for 
Search instead for 
Did you mean: 

How do I calculate a running/cumulative total in Dataprep?

calamari
DataRobot Employee
DataRobot Employee

How do I calculate a running/cumulative total in Dataprep?

A question that has come up a few times over the last few months, is How can I calculate a running/cumulative total in Dataprep?

 

Let's look at two methods to achieve this:

  1. Clicking - Using Dataprep's Window/Aggregate function.
  2. Coding - Using Dataprep's Spark SQL function (and using the cumsum function)

The below dataset will be used in both examples:

calamari_0-1649802103483.png

 

Importing it into DataRobot Dataprep looks like this:

calamari_1-1649802103528.png

 

Now let's get started....

calamari_9-1649802383184.gif

Method 1) Clicking – Windowing Aggregation


Once you have ensured you have a column to sort by (Sale Date in this example) and a column to compute the running total for (Sales/Returns in this example)

Step 1)  Add a Compute column by clicking the icon

calamari_2-1649802165981.png

 

Call the Column Dummy

Enter the value 1 (the value is of no consequence). The screen will look like this:

calamari_3-1649802166036.png

Step 2)  Add a Window -> Aggregate step by clicking

calamari_4-1649802166037.png

then

calamari_5-1649802166039.png

 

To show a cumulative total populate the window with:

  • Sum (from the drop down)
  • Sales/Returns (for the Column we’re making the calculation on)
  • Cumulative Total (for the new columns name)
  • Dummy as the Window Grouped by
  • Sale Date as the Window Sorted by
  • Starting from top of window
  • Ending at current row

calamari_6-1649802166094.png

Method 2) Coding– Spark SQL (cumsum)

Alternatively one can use the Spark SQL transformation by clicking the icon

calamari_7-1649802166096.png

 

(Note for convenience I renamed the columns to have no spaces or special characters).

 

The syntax I have used to use the Spark SQL cumsum function is:
SELECT SaleDate, SalesReturns, SUM(SalesReturns)

          over (ORDER BY SaleDate ROWS BETWEEN unbounded preceding AND CURRENT ROW ) cumsum

 FROM   dataset

 ORDER  BY SaleDate

 

Screenshot below:

calamari_8-1649802166118.png

 

And there you have it, two ways to calculate a running/cumulative total in DataRobot Dataprep!

 

calamari_10-1649802480204.gif

 

 

 

 

0 Replies