cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

πŸ—“ ASK THE EXPERT: Let's discuss Data Prep -Feb 24

Highlighted
DataRobot Alumni

Welcome to this Ask the Expert discussion

Data scientists spend over 80% of their time collecting, cleansing, and preparing data for machine learning. You can significantly simplify this with DataRobot Paxata. Using "clicks instead of code" reduces your data prep time from months to minutes and gets you to reliable predictions faster.

In this Ask the Expert event, you will be able to chat with Krupa and ask your questions about data prep. On this interesting and important topic, Krupa is available to help clarify and answer your questions.

 

Krupa.jpeg

Krupa Natarajan is a Product Management leader at DataRobot. Krupa has spent over a decade leading multiple Data Management products and has deep expertise in the space. She has a passion for product innovations that deliver customer value and a proven track record of driving vision to execution.

 

=============================================================================

Hi Everyone,

This Ask the Expert event is now closed. 

Thank you Krupa for being a terrific event host!

Let us know your feedback on this event, suggestions for future events, and look for our next Ask the Expert event coming soon.

Thanks everyone!

Labels (2)
28 Replies
Highlighted
Image Sensor

Hi @knat ,

That you for taking my question. My question is what's the difference between data prep for business intelligence / data warehousing and data prep for machine learning / AI?

Thank you, Nicole

Highlighted
Image Sensor

Hi Krupa - thanks for taking my question!

I'm very interested in the capabilities around data prep, as it's a critical step in the process for everyone. My question is, can I run a real-time prediction pipeline in Paxata?

Highlighted
DataRobot Alumni

Hi Nicole! A number of steps are common while there are some key differences.

Both BI and ML/AI use cases require that the user has access to data from a variety of data sources and ability to work with a variety of data formats, join datasets together, cleanse and standardize the data (this step is very important to ensure prediction quality), perform transformations, aggregations and such.

In addition, data prep exercises for ML/AI, can be split into two distinctive life cycles: (a) Training Dataset preparation and (b) Inference/Prediction Dataset Preparation

For Training datasets, the Data Scientist/Business Analyst preparing data should address the following critical aspects based on the business value they are trying to achieve

  • "What do I want to Predict?" - This helps the data scientist/business analyst clearly define the Target Variable, determine the type of Target variable (for example, is this a binary classification problem requiring the Target to be a binary or is this a linear regression problem requiring the Target variable to be numeric), and define computations to create targets. Definition of Target Variable could require that the user enrich the current dataset with additional variables (capabilities such as automatic Join detection can greatly augment this)
  • "For whom/what?" - This helps the data scientist/business analyst identify the unit of analysis and helps structure the unit of observation in the data leading towards that outcome. Data prep tools like Pivot, De-Pivot, Transpose help achieve the right Primary Table definition for the training exercise
  • "At what time?" - This defines when Predictions will be run and helps data scientist/business analyst identify and remove any variable that may not be available at the time of prediction. Tools like Paxata's column lineage can help surface the variables that contributed to Target variable definition and help the user quickly identify if there are scenarios that can potentially lead to Target overfitting or Target leakage

For Prediction time Data prep, you will need the data prep tool to operationalize and potentially automate as many of the data acquisition, data merging, cleansing and transformation steps, before the data can be sent to deployed models for generating prediction scores. In many cases, after scores are returned, more data prep steps may be applied

 

Highlighted
DataRobot Alumni

Thank you for your interest. 

Paxata has a new 'Predict Tool' that allows DataRobot deployments to be invoked directly from Data Prep projects. The data acquisition + data prep steps + prediction scoring can all be operationalized using the Intelligent Automation capability that exists in Paxata and scheduled to run automatically or on-demand.

The  is accessible through REST API as well, enabling near real time predictions

0 Kudos
Highlighted
Image Sensor

Thank you Krupa!

0 Kudos
Highlighted
Image Sensor

Krupa,  How is DataRobot Paxata Data Prep different from the other data prep tools on the market with regards to data prep for AI?

Thanks!

0 Kudos
DataRobot Alumni

Great question! Paxata has been the leader in the Data Prep market (by major Analyst reports such as the Gartner Magic Quadrant), and now with the merger of DataRobot and Paxata, DataRobot combines the best in class in Data Prep with the best in class in Enterprise AI platform. 

DataRobot Paxata is the only Data Prep offering that enables Data Scientists and Business Analysts to interact with their full scale of data without being limited to small samples. This is a key differentiator when it comes to enabling users to identify data quality issues and cleanse the data for ML exercises. 

DataRobot Paxata also has unique intelligence capabilities such as the patented Join detection. DataRobot Paxata automatically identifies how datasets Join together for Feature Enrichment. Algorithmic Fuzzy Join is supported for scenarios where the enrichment data coming from different systems and applications may be represented in different ways making exact matches nearly impossible - in common scenarios such as this, DataRobot Paxata's Fuzzy matching allows for Feature Enrichment regardless of the variation in data.

Another important capability is DataRobot Paxata's algorithmic standardization - with a single click, Paxata will identify similar values (example: misspellings in City names) in Categorical variables and standardize them leading to better training data and hence better prediction quality

DataRobot Paxata is closely integrated with the DataRobot core allowing for exploration of the AI Catalog from within the Data Prep experience, ability to invoke deployed models for prediction scoring from within the Data Prep project using the Predict tool and exploration of Prediction results including Prediction explanations in the Data Prep Project for better conversion of prediction to value

Highlighted
Image Sensor

Thanks for the response @knat . 

Could you please explain in a bit more detail how fuzzy join works?

0 Kudos
Highlighted
Blue LED

Hi, 

Regarding the data prep files, are there "any" file size limitations or file type limitations?

0 Kudos
Highlighted
Mounting Hub

Hi Krupa ,

Can we create new DB tables and write them back with paxata data prep or is it read-only?

thanks!

0 Kudos
Highlighted
DataRobot Alumni

There is no hard technical limit on Dataset/File sizes in DataRobot Paxata. There are however guardrails that are configurable. For ideal interactive user experience, typically you (admin in this case) will configure the relevant number of Spark cores needed to support the dataset sizes. DataRobot Paxata customer success teams can help determine the sizing. It is possible to configure limits on number of rows that user will interact with in their Project when creating Data Prep steps (typically in 10s of millions) and set a different limit on the number of rows that can be processed in a Batch job when the data prep steps are applied, with the ability to dynamically scale resources for completing the batch jobs.

Highlighted
DataRobot Alumni

In most Data Preparation exercises, Business Analysts and Data Scientists are working with raw Data from more than one Data Source (such as Database tables, Cloud Storage files, Cloud application data etc). Once the data preparation steps are applied, the prepared data (referred to as an 'Answerset' in DataRobot Paxata) is used in ML platforms for training models or running predictions. 

Although DataRobot Paxata supports a variety of DataSources to which you can write the data back, typically the prepared data is written to AI Catalog, Cloud Storage or Data Warehouses

Highlighted
DataRobot Alumni

Fuzzy matching help in scenarios where you will need to join Data from different Data Sources and Data may not be represented in the exact same way. For example, Customer name may be 'Danny Pool Service' in one Dataset and 'Danny's Pool Service & Repair' in another. 

DataRobot Paxata uses a number of algorithmic techniques such as application of Jaro Winkler and automatic detection of stop words (such as 'and', 'Inc', 'Jr' etc) to determine matches

Highlighted
Image Sensor

It is good to know there are no hard limits on file sizes. Looking forward to use Paxata soon!

Highlighted
Image Sensor

Hello Krupa,

Thank you for taking my question.

What are the most relevant and popular Paxata transforms for Data Scientists?

Cheers,

Chris

 

0 Kudos
Highlighted
DataRobot Alumni

Hi @c_stauder ! 

While there are a number of relevant transformations, the most important and heavily used ones would be:

  • Joins (Ability to combine Datasets for Feature enrichment is a critical part of the Data prep exercise for Data Scientists. Richness in capability such as automatic Join detection, Support for various types of Joins and option to Fuzzy match augment the Feature enrichment exercise)
  • Cluster and Edit (Algorithmically cleansing categorical variables to standard values is important in ensuring Prediction accuracy)
  • Computed columns (Paxata supports close to 100 transformations that span DateTime, Text, Mathematical, Statistical function types that help in transforming data and defining Target Variables)
  • Predict Tool (Ability to run Predictions against a deployed model in DataRobot and retrieve prediction scores and prediction explanations that can be explored and prep-ed further for feeding into subsequent models or downstream applications

A number of other transformations deserve mention - Remove Rows tool (for removing unwanted observations), Filtergrams (aid the visual exploration of data and selection of criteria for remove rows and other transformations), Aggregate operations such as fixed/sliding windowed aggregates, Imputation functions such as linear/average fill up and down, Shaping operations such as Pivot/De-Pivot and so on.

All of these transformations are automatically captured in the Step Editor for replay and/or sharing. DataRobot Paxata also allows multiple users to collaborate on a single Project while defining transformations

Highlighted
Image Sensor

Thank you very much Krupa!

0 Kudos
Highlighted
NiCd Battery

Hello

I'm reading all your responses and realized I have one. Can you explain how Paxata (with DataRobot) works with traditional data catalogs?

appreciate your time!

 

0 Kudos
Highlighted
NiCd Battery

@knat, Can I schedule my workflows to feed into datarobot modeling?

0 Kudos
Highlighted
Image Sensor

Hi Krupa, 

Thanks for taking my question: Can you please provide some context on the difference between data prep and ETL?

Thanks, Krista

0 Kudos
Highlighted
DataRobot Alumni

Hi @sallyS ! 

Paxata and DataRobot compliment traditional data catalogs. Users typically leverage a traditional catalog to locate data they are looking for and once they find that data they can bring that into DataRobot Paxata for preparation and then leverage prepared data in their AutoML exercise

0 Kudos
Highlighted
DataRobot Alumni

Yes, you absolutely can. You can schedule a DataRobot Paxata Automatic Project Flow (APF) to go from ingestion of data to data prep to scoring to post scoring data prep steps to export - this end to end workflow can be run as a single Job either on schedule  or on-demand through UI/REST API

0 Kudos
Highlighted
DataRobot Alumni

Hi Krista!

Great question. Traditional ETL typically caters to Data Engineers and IT developers that are very technical. IT developers receive requirements from business counterparts and implement the requirements into data pipelines. This is a waterfall model with the lifecycle involving requirements gathering, implementation, testing and delivery/acceptance by business. Any further changes that the business needs will start back at the top of that life cycle. 

Data Preparation tools on the other hand are built for Business Analysts. Business Analysts can interact with their data and interactively apply data cleansing and data transformation steps. In order to enable Business Analysts to achieve this, Data Preparation tools often embed intelligence and recommendations. For example, DataRobot Paxata can automatically detect Joins across datasets and bring datasets together while this would have traditionally been achieved through SQL scripts by an IT developer in an ETL tool

Another key difference is the nature of use cases. ETL tools have been very successful in loading data to enterprise warehouses where the structure of the data rarely change. Data Preparation tools are helpful when businesses need to work with often changing data and/or new data, as Business Analysts can explore the data and create transformations in an adaptive way.

Highlighted
Image Sensor

Hi Krupa,

Ok, great! Thank you for your super response!

Regards,

Krista

0 Kudos