cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

πŸ—“ ASK THE EXPERT: Let's discuss Data Prep -Feb 24

BobF
DataRobot Alumni

πŸ—“ ASK THE EXPERT: Let's discuss Data Prep -Feb 24

Welcome to this Ask the Expert discussion

Data scientists spend over 80% of their time collecting, cleansing, and preparing data for machine learning. You can significantly simplify this with DataRobot Paxata. Using "clicks instead of code" reduces your data prep time from months to minutes and gets you to reliable predictions faster.

In this Ask the Expert event, you will be able to chat with Krupa and ask your questions about data prep. On this interesting and important topic, Krupa is available to help clarify and answer your questions.

 

Krupa.jpeg

Krupa Natarajan is a Product Management leader at DataRobot. Krupa has spent over a decade leading multiple Data Management products and has deep expertise in the space. She has a passion for product innovations that deliver customer value and a proven track record of driving vision to execution.

 

=============================================================================

Hi Everyone,

This Ask the Expert event is now closed. 

Thank you Krupa for being a terrific event host!

Let us know your feedback on this event, suggestions for future events, and look for our next Ask the Expert event coming soon.

Thanks everyone!

28 Replies

Hi Krupa ,

Can we create new DB tables and write them back with paxata data prep or is it read-only?

thanks!

0 Kudos

There is no hard technical limit on Dataset/File sizes in DataRobot Paxata. There are however guardrails that are configurable. For ideal interactive user experience, typically you (admin in this case) will configure the relevant number of Spark cores needed to support the dataset sizes. DataRobot Paxata customer success teams can help determine the sizing. It is possible to configure limits on number of rows that user will interact with in their Project when creating Data Prep steps (typically in 10s of millions) and set a different limit on the number of rows that can be processed in a Batch job when the data prep steps are applied, with the ability to dynamically scale resources for completing the batch jobs.

In most Data Preparation exercises, Business Analysts and Data Scientists are working with raw Data from more than one Data Source (such as Database tables, Cloud Storage files, Cloud application data etc). Once the data preparation steps are applied, the prepared data (referred to as an 'Answerset' in DataRobot Paxata) is used in ML platforms for training models or running predictions. 

Although DataRobot Paxata supports a variety of DataSources to which you can write the data back, typically the prepared data is written to AI Catalog, Cloud Storage or Data Warehouses

Fuzzy matching help in scenarios where you will need to join Data from different Data Sources and Data may not be represented in the exact same way. For example, Customer name may be 'Danny Pool Service' in one Dataset and 'Danny's Pool Service & Repair' in another. 

DataRobot Paxata uses a number of algorithmic techniques such as application of Jaro Winkler and automatic detection of stop words (such as 'and', 'Inc', 'Jr' etc) to determine matches

It is good to know there are no hard limits on file sizes. Looking forward to use Paxata soon!

Hello Krupa,

Thank you for taking my question.

What are the most relevant and popular Paxata transforms for Data Scientists?

Cheers,

Chris

 

0 Kudos

Hi @c_stauder ! 

While there are a number of relevant transformations, the most important and heavily used ones would be:

  • Joins (Ability to combine Datasets for Feature enrichment is a critical part of the Data prep exercise for Data Scientists. Richness in capability such as automatic Join detection, Support for various types of Joins and option to Fuzzy match augment the Feature enrichment exercise)
  • Cluster and Edit (Algorithmically cleansing categorical variables to standard values is important in ensuring Prediction accuracy)
  • Computed columns (Paxata supports close to 100 transformations that span DateTime, Text, Mathematical, Statistical function types that help in transforming data and defining Target Variables)
  • Predict Tool (Ability to run Predictions against a deployed model in DataRobot and retrieve prediction scores and prediction explanations that can be explored and prep-ed further for feeding into subsequent models or downstream applications

A number of other transformations deserve mention - Remove Rows tool (for removing unwanted observations), Filtergrams (aid the visual exploration of data and selection of criteria for remove rows and other transformations), Aggregate operations such as fixed/sliding windowed aggregates, Imputation functions such as linear/average fill up and down, Shaping operations such as Pivot/De-Pivot and so on.

All of these transformations are automatically captured in the Step Editor for replay and/or sharing. DataRobot Paxata also allows multiple users to collaborate on a single Project while defining transformations

Thank you very much Krupa!

0 Kudos

Hello

I'm reading all your responses and realized I have one. Can you explain how Paxata (with DataRobot) works with traditional data catalogs?

appreciate your time!

 

0 Kudos

@knat, Can I schedule my workflows to feed into datarobot modeling?

0 Kudos