Your Learning Dataset should be cleaned before you begin any feature engineering. With DataRobot Paxata you can clean your data, and then prep it to add and remove features--all in a single project.
What data should be cleaned
If you’re looking for starting point examples of when you’ll need to clean your data, here are a few common examples.
Deduplication: you have a particular value represented in various ways and you need to standardize on one value, for example “New York”, “NY”, “New York City”, “NYC”, etc.
Remove leading values: for example, you need to remove leading zeros for Eastern US zip codes.
Standardizing date formats: for example, you have a dates column but the values in that column are represented in various formats--”mm/dd/yyyy”, “dd/mm/yy”, etc.
Data Prep Protips before you begin your prep
Consider before you aggregate. When you aggregate rows in your data, you are actually losing signals from the detailed records. If you think you must aggregate, then take the opportunity to use feature engineering to represent the data in another way and restore some of those lost signals. For example, you can add sums, means, standard deviations, etc. to create new features.
Consider your data point outliers before removing them from your data. Ask yourself if those observations in the data are valuable for the model to learn. Ultimately, you want to optimize for features that are important at prediction time because this results in faster computations with lower memory consumption.