Filling in Gaps for Time Series Data Using Database SQL

cancel
Showing results for 
Search instead for 
Did you mean: 

Filling in Gaps for Time Series Data Using Database SQL

Time series datasets for creating projects and scoring data require data at regular levels of granularity. Daily sales, weekly campaign clicks, etc.—whatever the use case may be. Daily sales for a single series may have records missing if no data was available or a store was closed for a day. In these cases, the gaps need to be filled in. How can this be done? Often the data is coming from a database, where this task can be accomplished fairly easily via SQL. Below I present two examples for filling in the gaps: a single series/store where sales aggregates are already available, and a multiseries example combining multiple stores to a set of sales records (akin to a customer-dimension table and sales-fact table join within a data mart or data warehouse). Although likely an unrealistic use case, to demonstrate the technique I’m showing both at a “per minute" granularity rather than daily.

Technique

The technique to create the full set of data is fairly straightforward: a master time dataset (which must be available) is cross-joined (joined to every record) to a master series dataset and then outer-joined to the actual values available where that time period and series exist with data. For example, all of a time master is joined to a store master to create a list of all possible time+store combinations and then joined to actual time+store values. When sales exist, the value (sales) is used; when sales don’t exist, an empty value (0 sales) is used. Finally, the list has time filters applied where applicable. For example, the date must be greater than or equal to the date a particular store opened, and the time can only run through to the date when a store closed or the prior day (if completed store sales data is available).

Time Dimension

A master calendar of some sort is quite often already available in a database. There is likely no data warehouse without a time dimension table, often represented at the daily level. There are multiple ways to construct a calendar via SQL statements; for simplicity’s sake, I will be using the Snowflake database and leveraging the code snagged from this Snowflake Community article. To keep things very simple, this will be an abbreviated table and contain only those days between December 2019 and January 2020.

 

CREATE OR REPLACE TABLE time_dim (
  time_id bigint not null
  , MY_DATE          DATE        NOT NULL
  ,YEAR             SMALLINT    NOT NULL
  ,MONTH            SMALLINT    NOT NULL
  ,MONTH_NAME       CHAR(3)     NOT NULL
  ,DAY_OF_MON       SMALLINT    NOT NULL
  ,DAY_OF_WEEK      VARCHAR(9)  NOT NULL
  ,WEEK_OF_YEAR     SMALLINT    NOT NULL
  ,DAY_OF_YEAR      SMALLINT    NOT NULL
) AS 
  WITH CTE_MY_DATE AS (
    SELECT DATEADD(DAY, SEQ4(), '2019-12-01') AS MY_DATE
      FROM TABLE(GENERATOR(ROWCOUNT=>62))  -- Number of days after reference date in previous line
  )
  SELECT SEQ8(), 
        MY_DATE
        ,YEAR(MY_DATE)
        ,MONTH(MY_DATE)
        ,MONTHNAME(MY_DATE)
        ,DAY(MY_DATE)
        ,DAYOFWEEK(MY_DATE)
        ,WEEKOFYEAR(MY_DATE)
        ,DAYOFYEAR(MY_DATE)
    FROM CTE_MY_DATE;

 

Although it’s generally inefficient to create a table at the granularity of hours/minutes/seconds, some helper tables can assist in generating master time sets at these granularity levels if needed. A couple of supporting tables that will be used in the more complex multiseries example (above) represent hours in a day, minutes in a second, and seconds in a minute. These tables will simply contain the values 0–23 and 0–60, respectively.

 

SELECT SEQ4() as hr 
FROM TABLE(GENERATOR(ROWCOUNT=>24));

create table minutes as
SELECT SEQ4() as minute
FROM TABLE(GENERATOR(ROWCOUNT=>60));

 

From these tables, the time granularity can be moved upward or downward from the time dimension represented as days in time_dim. It can be aggregated and filtered to create a master set with weeks ending on Saturdays, or months starting on the first; or cross-joined to hours to create an hourly set, or cross-joined further to the minutes helper—twice—to get a master time dataset at the granularity of every second. 

Single Series with Pre-Aggregated Sales

To demonstrate a single series with some gaps, a simple sales table will be created and used. The intent is to create a record for sales for the first ten days of January 2020, even though the table has 5 sales records and only 4 are for January.  

 

create or replace table sales as
select '25-Dec-2019'::date as sale_dt, 5000 as sales_amt
union all select '1-Jan-2020', 3432 
union all select '3-Jan-2020', 1231 
union all select '4-Jan-2020', 2221 
union all select '7-Jan-2020', 2791;

 

The SQL to produce the desired state will be straightforward. The time dimension and the aggregate values are both already at the day level. An outer join between both tables—limited only for the date period of interest—is all that is required to produce this dataset.

 

select t.my_date, s.sales_amt
from time_dim t
left outer join sales s on t.my_date = s.sale_dt
where t.my_date between '1-Jan-2020' and '10-Jan-2020'
order by t.my_date;

 

Screen Shot 2020-08-06 at 2.45.31 AM.png

This example was pretty simple; let's try a trickier one.

Multiseries Data Mart with Raw Sales

In this case, we'll leverage a more typical data mart data model in which there are two dimensions (time and store) and a fact table consisting of sales recorded at a per minute basis. The sales were retrieved from the operational website at second-level timestamp granularity, as shown in the inline raw transaction values. A store dimension with a store opening date and a sales fact can be constructed as follows.

 

create or replace table store_dim as
select 5000 as store_id, 'Chicago' as store_city, '5-Dec-2018'::date as store_opening_date
union all select 6000 as store_id, 'Boston' as store_city, '2-Jan-2020';

create or replace table store_sales as
select t.time_id, date_trunc('minute', raw_sales.sale_timestamp)::timestamp as sales_minute
, s.store_id, sum(raw_sales.sale_amt) as minute_sales_amt
from (
  select '2019-12-25 09:37:04'::timestamp as sale_timestamp, 45.37 as sale_amt, 'Chicago' as city
  union all select '2020-01-01 08:22:14', 25.41, 'Chicago'
  union all select '2020-01-03 18:03:43', 12.92, 'Chicago'
  union all select '2020-01-03 07:31:11', 17.36, 'Boston'
  union all select '2020-01-03 07:33:27', 41.72, 'Boston'
  union all select '2020-01-03 07:33:51', 57.11, 'Boston'
) raw_sales,
time_dim t, store_dim s
where raw_sales.sale_timestamp::date = t.my_date and raw_sales.city = s.store_city
group by t.time_id, date_trunc('minute', raw_sales.sale_timestamp), s.store_id;

 

Although the raw data came in as 6 records with granularity down to a second, the last two sales at Boston occurred within the same minute. Thus, the 6 transaction records became 5 when storing the values in the created minute-based fact table.

Screen Shot 2020-08-06 at 3.20.12 AM.png

The actual sales are at the minute-level granularity, and now a master time dataset at the minute grain is required. A cross join, or cartesian join, at this point between the daily time dimension and the helper hour and minute tables will result in this master time set. It will need to be restricted as well; for our illustration purposes, it will cover only the time interval from January 1st to 3rd. The records will need to be further restricted based on the stores. The Boston store was not opened until January 2nd, so no records should be produced for the 1st; to do so, an additional time restriction must be applied.

 

select s.store_city, dateadd(minute, h.hr * 60 + m.minute, t.my_date) as master_minute, ss.minute_sales_amt
from time_dim t
cross join hours h
cross join minutes m
cross join store_dim s
left outer join store_sales ss on dateadd(minute, h.hr * 60 + m.minute, t.my_date) = ss.sales_minute
    and s.store_id = ss.store_id
where t.my_date between '1-Jan-2020' and '3-Jan-2020'  --limit minute window to just minutes on these days
and t.my_date >= s.store_opening_date  --only produce records for series (stores) that were actually open as of the date
order by s.store_city, master_minute;

 

Screen Shot 2020-08-06 at 3.00.00 PM.png

IMPORTANT: The NULL values should be replaced with a value using a SQL isnull/nvl/coalesce function that is appropriate for a given use case. This may be a 0 for sales, an average, a prior value for sensor data, etc.

A Note on Training vs. Prediction Imputation for Time Series Forecasting:

The time series imputation outlined above will often need to be performed for both generating a training set and creating a batch of scoring data at prediction time for machine learning models leveraging time series data. The prediction point is the time at which we want to make predictions (usually the current day) into the future. Time series models usually need recent past history prior to the prediction point from which to derive features used to make predictions (e.g., lags and aggregations for the last 30 days used to forecast what’s going to happen for the next 7 days). In DataRobot, this prior window of time is the Feature Derivation Window (FDW). In the case of forecasting, the models may also be making multiple predictions across a future time horizon (2, 3, and 4 days into the future, also called the Forecast Window (FW)).

lhaviland_0-1626901673051.png

Some information about the future will be known in advance and these features will need to be included in the prediction batch sent to the model. When creating a batch of prediction data to send to a trained time series model, it will be important to perform the same imputation steps that were applied at training time, with a few minor changes. The time series master table outer-joined with the historical time series will now need to include the length of the FDW preceding the prediction point along with the succeeding forecast horizon. This will create a dataset with a structure similar to the one below:

lhaviland_1-1626901408295.png

This ensures that missing rows are imputed appropriately within the FDW with the same logic used to create the training set and that the forecast horizon is provided for the model to “fill-in” with predictions. As always, it‘s important to assess what type of imputation is necessary based on your use case and what missingness of a timestamp means in a business context (e.g., Does missingess mean there was no transaction, or that there was an error in the system, or some value exceeded a threshold?). This will inform whether zero imputation, forward-filling, or some kind of constant value imputation is appropriate.

Conclusion

Using the methods I’ve presented here, holes in both times and series can be plugged, for any granularity of time being explored in a time series project. Only records with real values need be available in the source system, with gaps filled in as necessary with some helper tables and SQL applied to them. For additional help in these situations, feel free to comment below or reach out to your DataRobot account team for further assistance.

Labels (1)
Version history
Last update:
a week ago
Updated by:
Contributors