Can DataRobot differentiate between missing at random or missing with a reason while executing the algorithm? If yes, then how do they algorithmically go about doing that for multiple modeling techniques?
Solved! Go to Solution.
Hey George, great question!
We definitely do handle both missing-at-random values and missing-not-at-random values with our blueprints. Using one set of pre-processing steps for many different modeling algorithms would generally favor some algorithms over others, and would not efficiently capture both types of missing values for different algorithms, so this is not a trivial issue. DataRobot chooses an intelligent mix of pre-processing steps and modeling algorithms for each of our modeling blueprints, which are customized for each dataset.
For example, with numeric values:
For linear models, we generally impute missing values using the median, while creating a binary missing indicator feature. The binary feature will capture missing-not-at-random trends, while imputing with the median will automatically work for missing-at-random.
For tree-based models, we generally replace missing values with arbitrary values like -9999, and tree-based models handle those gracefully, splitting them away from other values if there is a relevant trend there, or relying on other features more if there is not.
Does that answer your question?
Best,
Alex
Hey George, great question!
We definitely do handle both missing-at-random values and missing-not-at-random values with our blueprints. Using one set of pre-processing steps for many different modeling algorithms would generally favor some algorithms over others, and would not efficiently capture both types of missing values for different algorithms, so this is not a trivial issue. DataRobot chooses an intelligent mix of pre-processing steps and modeling algorithms for each of our modeling blueprints, which are customized for each dataset.
For example, with numeric values:
For linear models, we generally impute missing values using the median, while creating a binary missing indicator feature. The binary feature will capture missing-not-at-random trends, while imputing with the median will automatically work for missing-at-random.
For tree-based models, we generally replace missing values with arbitrary values like -9999, and tree-based models handle those gracefully, splitting them away from other values if there is a relevant trend there, or relying on other features more if there is not.
Does that answer your question?
Best,
Alex