Edit data in a project

MukeshB · ‎09-07-2023

Hi All,

Once we have created a project with data loaded, after that we can view the data, but while exploring the data if we find some data quality issues, can we correct the data or edit the data and save it in a separate project ?

corykind · ‎10-02-2023

Great question! Thanks for asking.

Once data is loaded into a Project, it is generally immutable - this is by design, so users can more easily compare the apples-to-apples performance of different modeling approaches on the Leaderboard. The primary exception is feature lists - you can always create and test different feature transformations and feature lists within an existing Project.

If you do see data quality issues that you want to resolve, the best place to do that is before actually creating a Project. Here’s how you can do that with DataRobot:

Prepare data in the AI Catalog with SparkSQL - This is a good option for removing outliers or subsetting data that you already have in AI Catalog. Essentially, use SparkSQL to make the data changes you want to your existing Dataset, save the result as a new Dataset, and then you can easily create new Projects to test the effects of those changes. If you want to continue to iterate, you can always edit and rerun your SparkSQL code.
Build a Recipe with DataRobot Wrangler - Wrangler is a DataRobot feature that allows you to visually inspect and prep the right data before it is actually moved into the AI Catalog. As a user, you can do EDA, exclude rows or columns, execute transforms and joins, apply sampling, etc., and then Publish (e.g., materialize to DataRobot AI Catalog) the final version of the data that you actually want to use for modeling. The “Recipe” you create can be saved and tracked for the future. Note that unlike SparkSQL, Wrangler is only supported for some types of Database Connections right now.

Edit data in a project

Data Prep

Exploratory Data Analysis

Modeling of many products - How to efficiently cre...

wrangling flag

Many Fold CV

OTV Partitioning

Dataset split