Solved: Re: Tip for accessing transformed data by using Co... - DataRobot Community

Jaume Masip · ‎06-14-2022

Let’s assume that you have just completed the AutoPilot modeling process for one of your projects in DataRobot. Now, you are curious about the preprocessing steps that DataRobot has automatically selected and you would like to access the transformed data of a given preprocessing step.

Likewise, you believe that you can get an accuracy uplift of your top model from the leaderboard by modifying some of the existing processing steps but you want to review that the output of the pre-processed step (i.e., the transformed data) is as you expect.

You can easily access the transformed data from any model of the leaderboard and regardless of the two modeling contexts discussed above by leveraging Composable ML. The step-by-step process is described below

Step 1. Create and upload a Custom Task from the Model Registry tab. More information on Custom Tasks can also be found in DataRobot public github repository.

Basically, we need to create a custom.py file that downloads the output of any preprocessing step in a csv format file. For this, we can use the following code within the fit hook function. The transform hook function returns the transformed data.

from typing import List, Optional
import pickle
import pandas as pd
import numpy as np
from pathlib import Path

def fit(
    X: pd.DataFrame,
    y: pd.Series,
    output_dir: str,
    class_order: Optional[List[str]] = None,
    row_weights: Optional[np.ndarray] = None,
    **kwargs,
) -> None:

    output_dir_path = Path(output_dir)
    if output_dir_path.exists() and output_dir_path.is_dir():

        #output all input training data into a csv so it can be downloaded via Artifact download
        X.to_csv("{}/transformed_data.csv".format(output_dir), index = False)

        #create an empty artifact file to satisfy drum requirements
        with open("{}/artifact.pkl".format(output_dir), "wb") as fp:
            pickle.dump(0, fp)

def transform(X: pd.DataFrame, transformer):  
    return X

Once we have created this py file, we can upload the custom task in DataRobot as follows

Note that we have used this pre-build DataRobot’s Environment [DataRobot] Python 3 Scikit-Learn Drop-In to run the custom task but if necessary, you can indeed use your own Custom Environment.

Step 2. Modify a DataRobot-generated model to add this new Custom Task and retrain the model.

For this example, let’s assume that we would like to explore the output data of two sequential pre-processing steps (see screenshot below) that DataRobot has automatically selected to handle a Text Variable (Consumer complaint narrative) in a Multiclass ML project that aims to predict the type of Consumer Complaints.

We need first to search for a Blueprint of interest (e.g., 64% sample size version of the top model from the leaderboard) and click on copy and edit the blueprint

Once we have copied the blueprint we need to 1) add a new task, 2) select the custom tasks (either by typing the name of the “Access Transform data in a Blueprint” or by searching this custom task within the “Custom” group of the right menu) and 3) retrain the blueprint as the following two screenshots show:

Step 3. Download the Model Artifact of the retrained model

Once the blueprint has been re-trained, we can download the Model Artifact from the Download option of the Predict tab of a given model

Step 4. Inspect the transformed data

After we have downloaded the model artifact for this modified blueprint, we can open the csv file and explore the transformed data as the next figure shows:

Let me know if you have further questions about this tip!

@Jaume Masip

Jaume Masip · ‎06-14-2022

Hello @shaz13
Very good question!
It should work. Please find an example below

View solution in original post

shaz13 · ‎06-14-2022

This is awesome. Does this also works with Image based data?

Jaume Masip · ‎06-14-2022

Hello @shaz13
Very good question!
It should work. Please find an example below

jenD · ‎06-27-2022

Thanks for this tip @Jaume Masip !

Tip for accessing transformed data by using Composable ML

Tip for accessing transformed data by using Composable ML

Modeling

Tips and Tricks

dataset with multiple targets

Push Jar File to Snowflake

Automate downloading scoring jar file from a DR de...

Use the REST API to get more meta-data on Predicti...

Three different ways to build an insurance loss co...