The express goal of the DataRobot Platform is to empower anyone to quickly and easily build AI applications and maintain them over time. The quality of the data you use to create those applications and then feed into them for predictions is critical to your overall success.
DataRobot’s AI Catalog is comprised of three key functions:
Ingest—How data gets into DataRobot and is sanitized for use throughout the platform.
Storage—The AI Catalog is where all reusable data assets can be found and understood.
Data Preparation—All the capabilities you need to clean, blend, transform, and enrich your data to maximize the effectiveness of your application.
Review the results of the Exploratory Data Analysis (EDA) that is performed upon Ingesting the data. (If you are a licensed DataRobot customer, you can seach the in-app Platform Documentation for Overview and EDA to learn more.)
Data can be ingested into DataRobot from your local system, URLs, Hadoop (if deployed in a Hadoop environment), and via Data Connections to common databases and data lakes. A critical part of the data ingestion process is Exploratory Data Analysis (EDA). EDA actually happens twice within DataRobot, once when data is ingested and again once a target has been selected and modeling has begun. (For information about EDA and supported file types, see this community article, Importing Data Overview. (If you are a licensed customer, you can find more information in the in-app Platform Documentation by searching for Dataconnections, Overview and EDA, or Dataset requirements.)
DataRobot’s AI Catalog is a centralized collaboration hub for working with data and related assets. The AI Catalog allows you to seamlessly find, understand, share, tag, and reuse data. Data assets within AI Catalog can either be materialized “snapshots” of tables/views or be “dynamic”, meaning that the whole dataset is only ingested from your data source when you create a modeling project from it, thus allowing you to work with the most up-to-date data. If the data is Snapshotted, those snapshots can be automatically refreshed periodically, and are also automatically versioned to preserve dataset lineage and enhance the overall governance capabilities of DataRobot. (If you are a licensed customer, you can find more information in the in-app Platform Documentation by searching for AI Catalog and Load data/create projects doc.)
Data preparation plays a critical role in any AI/ML project. Raw data, directly from the source rarely is clean enough, has the correct unit of analysis, or is enriched enough to be useful as-is. DataRobot, in the true spirit of Automated Machine Learning, automates as much of the data cleaning and feature engineering as possible and does so in ways that are specific to each model type. Data enrichment is also easy within the DataRobot system, Feature Discovery can automatically join datasets and create new features for you.
However, there is a point where manual data preparation is needed. DataRobot currently offers two types of data preparation:
DataRobot Paxata data prep allows data analysts and citizen data scientists to visually and interactively explore, clean, combine, and shape data for training and deploying machine learning models and production data pipelines to accelerate innovation with AI. Data science teams can collaborate, reuse, and share data sources, datasets, and recipes with full enterprise governance and security to ensure compliance with organizational policies. *Click HERE request a14-day trial of DataRobot Paxata
Spark SQL code prep allows any DataRobot user to enrich, transform, shape, and blend together datasets using Spark SQL queries right within the AI Catalog.