We've all been there: data for customer transactions are in one table, but the customer membership history is in another. Or, you have sensor-level data at the sub-second level in one table, machine errors in another table, and production demand in yet another table, all at different time frequencies. Electronic Medical Records (EMRs) are another common instance of this challenge. You have a use case for your business you want to explore, so you build a v0 dataset and use simple aggregations from before, perhaps in a feature store. But moving past v0 is hard.
The reality is, the hypothesis space of relevant features explodes when considering multiple data sources with multiple data types in them. By dynamically exploring the feature space across tables, you minimize the risk of missing signal by feature omission and further reduce the burden of a priori knowledge of all possible relevant features.
Event-based data is present in every vertical and is becoming more ubiquitous across industries. Building the right features can drastically improve performance. However, understanding which joins and time horizons are best suited to your data is challenging, and also time-consuming and error-prone to explore.
In this accelerator, you'll find a repeatable framework for a production pipeline from multiple tables. This code uses Snowflake as a data source, but it can be extended to any supported database. Specifically, the accelerator provides a template to: