Feature engineering (the process of transforming raw data into new features that are the key drivers for a particular business problem) is a critical and time-consuming step that plays an important role in the success of enterprise AI projects. Since data can be stored across multiple systems and represented in many different formats, it is rarely collected with a particular business problem in mind. This means designing, preparing, generating, and testing new impactful features still poses significant challenges to enterprises when going from raw data to a deployable model.
This article is the first in a series that discusses the main challenges of feature engineering in real-world problems. Most of the value from AI projects comes from having machines making or supporting intelligent decisions, usually about something that is going to happen. Preparing historical data in order to make accurate (future) predictions is at the core of feature engineering. However, it's critical to be mindful of target leakage, which is when information that shouldn’t be available before the prediction time is captured in the data and resulting predictive model. If target leakage occurs during feature engineering, the model will fail when deployed because the features that caused target leakage won’t work at prediction time.
To explain the relevant concepts we will use the illustrative example of loan defaults. Our goal—to predict if an applicant is going to default on a loan or not—is not something we know at the time of application. Loan default will be our target. In addition, we have historical data about previous loan applications and whether or not they have been repaid; this tells us the outcome of the target. We also have more information from each loan application, including demographics of the applicant, maybe their credit rating history and, in some cases, even a history of past credit card transactions.
Figure 1. Information not known at Prediction time, after the FDW
All this information is known at the time of loan application, or the “Prediction time.” At that time, the predictive model is expected to return a prediction about a future event; for our example, this will indicate whether or not the loan is going to default. Figure 1 shows an illustration of prediction time and how the data (in green) is available at that time and so can be used for feature engineering.
As shown, proper feature engineering requires avoiding target leakage and time awareness is critical for success when generating features.
This brings us to the concept of feature derivation window (or FDW), which is also represented in Figure 1. This temporal window has a fixed temporal length (e.g., 3 months, 1 year, etc.), which is defined in relation to the prediction time. So, in our example, if we have applications that come at different prediction times, we use the FDW to ensure that all of the applicant data used to create features for our loan default model is available before the prediction time.
Another important concept is the cut-off, which represents the actual start of the FDW. When this happens, there is an operational gap between the actual prediction date and the date at which the data that can be used at prediction time is available. Figure 1 shows cut-off represented before the prediction time; in problems where this operational gap doesn’t apply, then the cut-off is the same as the prediction time.Figure 2. Loan data with target defined
A properly defined FDW ensures that no data after the prediction time will be used in the model. However, it is important to mention that, when creating the target, we will need to use that post-prediction-time data. Creating the target sometimes also involves data transformation because the feature we are trying to predict might not be stored in the available data. For example, if we only have only repayment data then we will need to infer that a client who didn’t repay the loan (according to that data) should be flagged as a bad loan (BadLoan).
In Figure 2, we see a concise description of our loan default problem. We have a customer identifier (CustomerID), the target (BadLoan), and the prediction time (date, in this case the date of application). Here the target has already been given to us; note that, this is a topic that we plan to discuss further in a future article. (Click here to see a recent DataRobot Community learning session about this.)
Now that our loan default problem table (or primary table) has been described (as shown in Figure 2), the next step is to add features that can help us to model a BadLoan. We can link other tables that have information about the customer. One simple way to incorporate additional features would be to link an existing profile table containing demographics of the customer that were collected before the prediction time. A more interesting example is to use a credit card transaction table where a single customer might have multiple records. Figure 3 shows an example of this type of table; note that the records need to be summarized (e.g., average transaction amount, trending spend over past 6 months, etc.) before being linked with the primary table (so that we have one record per customer).
Figure 3. Example credit card transaction table
As explained when presenting the feature derivation window concept, only records before the prediction date can be used when generating features (again, see Figure 1). It is also important to consider the length of the FDW as different statistics will be computed. For instance, average spend in the last 14 or 30 days, most frequent transaction type, etc. Something worth noting is that we can see different feature types in our transaction table. The type of feature aggregation and transformation will have to take the feature type into account. While statistics like min, max, and average are easy to implement and understand, other data and concepts such as categoricals, text, geo, and image data can make feature engineering more challenging. (This is another topic that we plan to expand on future articles.)
In this article, we introduced the importance of feature engineering and some of its fundamental challenges in real-world AI projects. We mainly focused on target leakage and problem formulation using loan defaults as an illustrative example.
In the coming articles we plan to discuss model deployment, data types/transformations, and a process for designing a primary dataset with a target variable.