A question that has come up a few times over the last few months, is How can I calculate a running/cumulative total in Dataprep?
Let's look at two methods to achieve this:
- Clicking - Using Dataprep's Window/Aggregate function.
- Coding - Using Dataprep's Spark SQL function (and using the cumsum function)
The below dataset will be used in both examples:
Importing it into DataRobot Dataprep looks like this:
Now let's get started....
Method 1) Clicking – Windowing Aggregation
Once you have ensured you have a column to sort by (Sale Date in this example) and a column to compute the running total for (Sales/Returns in this example)
Step 1) Add a Compute column by clicking the icon
Call the Column Dummy
Enter the value 1 (the value is of no consequence). The screen will look like this:
Step 2) Add a Window -> Aggregate step by clicking
To show a cumulative total populate the window with:
- Sum (from the drop down)
- Sales/Returns (for the Column we’re making the calculation on)
- Cumulative Total (for the new columns name)
- Dummy as the Window Grouped by
- Sale Date as the Window Sorted by
- Starting from top of window
- Ending at current row
Method 2) Coding– Spark SQL (cumsum)
Alternatively one can use the Spark SQL transformation by clicking the icon
(Note for convenience I renamed the columns to have no spaces or special characters).
The syntax I have used to use the Spark SQL cumsum function is:
SELECT SaleDate, SalesReturns, SUM(SalesReturns)
over (ORDER BY SaleDate ROWS BETWEEN unbounded preceding AND CURRENT ROW ) cumsum
ORDER BY SaleDate
And there you have it, two ways to calculate a running/cumulative total in DataRobot Dataprep!