Solved: differentiating missing at random vs. missing with... - DataRobot Community

george1986 · ‎12-16-2019

Can DataRobot differentiate between missing at random or missing with a reason while executing the algorithm? If yes, then how do they algorithmically go about doing that for multiple modeling techniques?

alxw · ‎12-17-2019

Hey George, great question!

We definitely do handle both missing-at-random values and missing-not-at-random values with our blueprints. Using one set of pre-processing steps for many different modeling algorithms would generally favor some algorithms over others, and would not efficiently capture both types of missing values for different algorithms, so this is not a trivial issue. DataRobot chooses an intelligent mix of pre-processing steps and modeling algorithms for each of our modeling blueprints, which are customized for each dataset.

For example, with numeric values:

For linear models, we generally impute missing values using the median, while creating a binary missing indicator feature. The binary feature will capture missing-not-at-random trends, while imputing with the median will automatically work for missing-at-random.

For tree-based models, we generally replace missing values with arbitrary values like -9999, and tree-based models handle those gracefully, splitting them away from other values if there is a relevant trend there, or relying on other features more if there is not.

Does that answer your question?

Best,

Alex

View solution in original post

alxw · ‎12-17-2019

Hey George, great question!

We definitely do handle both missing-at-random values and missing-not-at-random values with our blueprints. Using one set of pre-processing steps for many different modeling algorithms would generally favor some algorithms over others, and would not efficiently capture both types of missing values for different algorithms, so this is not a trivial issue. DataRobot chooses an intelligent mix of pre-processing steps and modeling algorithms for each of our modeling blueprints, which are customized for each dataset.

For example, with numeric values:

For linear models, we generally impute missing values using the median, while creating a binary missing indicator feature. The binary feature will capture missing-not-at-random trends, while imputing with the median will automatically work for missing-at-random.

For tree-based models, we generally replace missing values with arbitrary values like -9999, and tree-based models handle those gracefully, splitting them away from other values if there is a relevant trend there, or relying on other features more if there is not.

Does that answer your question?

Best,

Alex

differentiating missing at random vs. missing with a reason

differentiating missing at random vs. missing with a reason

Paxata Cache Folder

how to transform the var type in workbench

Understanding Model

Time Series Modelling

Trial Walkthrough Issue