We found a few features that aren't useful to training our model and want to remove them. Is this something we have to do manually (in the model)? Also, would DataRobot actually recognize them as non-useful and recommend removal removing them when validating?
”If you want to aim for parsimonious models, you can remove features with a low feature impact score. To do this, create a new feature list (in the Feature Impact tab) that has the top features and build a new model for that feature list. You can then compare the difference in model performance and decide whether the parsimonious model is better for your use case.”
hope this helps!
We have a feature that is really just a character string project ID. It is being treated as a text feature with significant importance associated with it, and we know it shouldn't be. What is the easiest way to ask DR to ignore the feature forever? Is it better to delete it from our input spreadsheet?
Hi @mike-pell ,
The short answer is that DataRobot will recognize features that are not useful to the model. For each model in DataRobot, we calculate the feature impact (i.e. importance) using either SHAP or permutation importance. Then, we will automatically run a reduced feature list using the feature importances of the top model. So, e.g. if your top model is M123, you'll see a new featurelist called DR Reduced Features M123 and then DataRobot will retrain M123 using that reduced feature list. Note DataRobot will also exclude redundant features from the Reduced Features, defined as those features with nearly identical effect on the target.
But more broadly, many of our approaches are able to "remove" features effectively:
- L1 regularization is able to shrink feature coefficients to 0, effectively "removing" them from the model.
- Tree based blueprints may learn to not split on certain features, effectively ignoring them
- Neural networks likewise can learn weights close to or effectively equal to zero, effectively ignoring certain features
With that said, it is still best practice to remove features that are clearly random, e.g. ID fields. And feature selection, i.e. generating parsimonious models, becomes increasing important as your dataset shrinks. This is because spurious correlations are more likely.
For example, let's say you have a dataset of 200 customers who either bought or didn't buy your product, split evenly between the 2 classes. Let's further assume you have a completely random datapoint, e.g. "shirt_color", which can be green, black, white, red, etc. Since you only have 200 customers, by sheer chance it might be that everyone wearing a red shirt (say 6 people) bought your product. This might show up as an impactful feature despite the fact it's really statistical noise.