@max_roman , good question.
A few quick things on missing value imputation:
- different blueprints will handle this in different ways. Linear/coefficient blueprints will use the median imputation you mention, while tree-based blueprints impute an arbitrary nonsense-value far outside of the data distribution.
- 'Missing indicator' columns/features are created but not made visible directly as their impact/effect are grouped into the 'core' feature on which the 'missing indicator' column was created.
- You can see which features have had missing treatment, and get info in the 'Describe -> Data Quality Handling Report' on the BP on the leaderboard.
These insights may or may not be particularly informative. In the above case, we see that there weren't any missing values. In the case below, we see quite the opposite: 20-50% of values are missing for many features.
Notice the feature indicated by the orange arrow. It has ~51% missing values, and did end up being used in the model. But you see that it has quite a small impact, and is likely not as relevant because it contained so many missing values (and lost otherwise-relevant signal).
This model was built with this feature in the context in which it is missing ~51% of the values in training data, and so our understanding of its importance to the model is related to that. If future data (or updated training data to find and fill those missing values) did not have as many values missing, then the impact of that feature to the model might be different, and it could have a larger Feature Impact score. This highlights the importance of data-drift tracking (training vs future predictions), and an indication that you should retrain your model if you saw newer data with fewer missing values (as an example).