cancel
Showing results for 
Search instead for 
Did you mean: 

Feature Engineering Data Transformation

Feature Engineering Data Transformation

Hello, I have a primary dataset, and multiple secondary dataset to join with. How can we make DataRobot keep all the original fields (regardless of it being informative or not), and make calculated field using two or more features (before modeling starts)?

 

For example, I have alert date in primary data, and can be joined by multiple account opening date in secondary data. I would like to calculate difference between account opening and alert data as account age; however account open date is discarded by DataRobot.

 

Please advise, thanks

Labels (1)
7 Replies
joao
Data Scientist
Data Scientist

Hi @Jieruide,

 

In the AI Catalog, you can create a custom feature list on your secondary dataset that includes the non informative feature. Then you can use the custom feature list in the relationship editor. DataRobot will then use the 'account opening' feature when searching for date differences. If the difference between 'account opening' and 'alert date' is informative, it will be available in your project before modeling. You can also inspect the feature derivation log to confirm the features in your custom feature list have been included.

Hi joao, thanks for the answer. I did create custom feature list and use it in relationship editor. Also I turned off supervised feature reduction. Overall I did get more features, but still the date features are discarded. To provide more info:

 

  • The project is time aware, alert date is the time stamp, however open date is not set to time aware
  • How does DataRobot treat date variables? What kind of transformations are performed/allowed? I did see transformations on day of week, etc. but the open date itself or days from open date to alert date is not shown.

Please advise, thanks

joao
Data Scientist
Data Scientist

Hi @Jieruide, I suggest you use open date as time aware so the date difference with the primary date (alert date) is computed. Note that you should set a feature derivation window large enough to ensure a good coverage from that secondary dataset. Let me know if its works. Thanks

Hi @joao , I talked to DataRobot, it is suggested that I add a secondary table that has one to one relationship to primary table WITHOUT time aware setup. I had similar thought as you suggested, but then it need to set look back period to accommodate the earliest open date.

 

Because my secondary table has many to one relationship with primary, I ended up deriving the feature myself, and summarize using avg, min, or max. e.g. avg account open date.

Hey @Jieruide - Thanks for sharing that guidance back to the community. Hoping this means you're moving forward again to predictions!

-linda