cancel
Showing results for 
Search instead for 
Did you mean: 

How the complexity factor is calculated for relationship?

How the complexity factor is calculated for relationship?

Dear, Community!

I'm trying to create a project based on multiple data sets to use Data Robot's feature discovery.

I created a "time-aware" relationship by setting a feature derivation window, but this relationship was skipped because of the complexity factor of ~38. The Feature Derivation Log says

 

Relationship DR_PRIMARY_TABLE[key1, key2], secondary_table[key1, key2] was removed because complexity factor 38.9601854077 exceeds 30

 

My question is how this metric is calculated and what are the ways to decrease it?

 

Best regards,

Evgeni

1 Solution

Accepted Solutions
Kenny
DataRobot Alumni

The complexity factor is estimated from the number of rows that need to be processed for a secondary dataset, as a multiple of the number of rows in the dataset.

In time-aware joins where there are multiple primary rows with the same join keys but different prediction points, the same rows  in the secondary dataset needs to be processed multiple times for each prediction point. In such cases the number of rows processed can far exceed the actual dataset size by many times, making the joins computationally expensive to execute. Having many examples in the primary dataset that derives features from overlapping windows is also generally not a good Data Science practice as this can sometimes result in model overfitting.

To reduce the complexity, we can reduce the number of prediction points for each unique set of join keys in the primary dataset, or reduce the size of feature derivation window.

View solution in original post

3 Replies
Kenny
DataRobot Alumni

The complexity factor is estimated from the number of rows that need to be processed for a secondary dataset, as a multiple of the number of rows in the dataset.

In time-aware joins where there are multiple primary rows with the same join keys but different prediction points, the same rows  in the secondary dataset needs to be processed multiple times for each prediction point. In such cases the number of rows processed can far exceed the actual dataset size by many times, making the joins computationally expensive to execute. Having many examples in the primary dataset that derives features from overlapping windows is also generally not a good Data Science practice as this can sometimes result in model overfitting.

To reduce the complexity, we can reduce the number of prediction points for each unique set of join keys in the primary dataset, or reduce the size of feature derivation window.

Thanks, @Kenny !

It explains the way it works a bit. Would the rules change if I would use a prediction point instead of a time-aware project, or feature engineering process is the same for these 2?

 

Am I got it right that there are 2 ways to decrease a relationship complexity:

1. Make derivation window smaller;

2. Decrease the number of features in the secondary data set?

 

Best regards,

Evgeni

Yes, the feature engineering process is the same and only depends on the selected prediction point. When date partitioning is selected the primary date used for partitioning is automatically set as the prediction point. An alternative prediction point can also be selected manually as long as that date is always on or before the partitioning date.

You are right that reducing derivation window size will reduce the complexity.

However, decreasing the number of features does not affect the complexity as it does not change the rows that needs to be processed. Instead, you can use prediction points that are spaced further apart in the primary rows with the same join keys.

For example, with a feature derivation window of 30 days with a secondary dataset, the following primary dataset can have features derived from overlapping time windows. The complexity can be high if the the primary dataset is large.

CustomerIDPrediction Point
12016-02-01
12016-02-08
12016-02-15

 

By picking prediction points that are spaced similar or larger than the derivation window overlaps can be reduced, which will reduce complexity.

CustomerIDPrediction Point
12016-02-01
12016-03-01
12016-04-01