I'm confused as to how feature selection and hyperparameter tuning work in a supervised setting with the DataRobot Platform. As a data scientist, I've recently encountered issues trying to split a dataset into (train-validation-test) and using these splits to run feature selection and hyperparameter tuning. If I use my validation set to test my feature selection methods to pick my best features, then can I use the same validation set to tweak my hyperparameters. To explain further, after I pick my best features, I use those features in a model to fit on the training set and then run hyperparameter tuning on the validation set. The problem with this is I have tweaked my features on the validation set, and I'm using those features to then tweak my hyperparameters with the validation set, I've technically already seen the validation set and optimized to it. I would then use the test set to see how my model would perform on out-of-sample data. How does DataRobot work around this problem/ what are strategies to run both feature selection and hyperparameter tuning in an ML pipeline?
I like that you are thinking about these issues. Overfitting to a validation dataset is a common problem within data science.
It's important to highlight:
DataRobot does not use the validation dataset for hyperparameter tuning. Validation scores are entirely out of sample. This helps to prevent the overfitting that can occur when tuning to a validation dataset.
Instead, we use a split within the training data that we use to tune the model. You can learn a lot more in the documentation under: Data partitioning and validation
Thank you for the follow-up rshah. Not using the validation set to hyperparameter tune does make sense to me that we will avoid overfitting to this data. What is the validation set used for then? If the validation set is not used to tune any portion of the model building process, does DataRobot just use it to pick the best final model to deploy after all models have been built and tuned? At that point, it's not much different than another holdout set? The results should be the same for validation and holdout? My confusion still lies in how DataRobot manages to perform feature selection and hyperparameter tuning on the data. Does it use different data to pick features than it uses to perform hyperparameter tuning? If not, would there not be data leakage by optimizing against the same data in both of these processes and the model would result in being overfitted? Or are features treated as hyperparameters and feature selection and model hyperparameters all tuned in the same process? Thank you again! @rshah
Hi @rshah . After reading the data partitioning andvalidation documentation, I understand how inner folds are used to hyperparameter tune, and then the final model with tuned hyperparameters is picked off the normal validation set. However, I'm still confused about the feature selection process. If feature selection uses data that will be used in the hyperparameter/model selection process to tune/pick the best model, then doesn't overfitting occur since we are using the same data for both processes? For example:
Model-agnostic feature importance. Before running anyalgorithms, DataRobot determines the univariate importance of each feature with respect to thetargetvariable
Using this approach, linking features to a target requires using data that might be used later down the pipeline to tune hyperparameters and select potential models. Thank you again!
Let's dive into feature selection with DataRobot. DataRobot starts a set of informative features from your data (removing features that don't have any value, such as duplicate features or features that have the same value). This feature list is then applied to a variety of blueprints that contain different algorithms.
DataRobot does only one step of feature selection automatically, which is to create a DR Reduced feature list, which is a subset of the most important features for the best performing model. By only doing this explicit feature reduction once, there should be minimal concern about overfitting. This process of feature selection is described here: https://community.datarobot.com/t5/resources/feature-lists/ta-p/1825