Data scientists spend over 80% of their time collecting, cleansing, and preparing data for machine learning. You can significantly simplify this with DataRobot Paxata. Using "clicks instead of code" reduces your data prep time from months to minutes and gets you to reliable predictions faster.
In this Ask the Expert event, you will be able to chat with Krupa and ask your questions about data prep. On this interesting and important topic, Krupa is available to help clarify and answer your questions.
|Krupa Natarajan is a Product Management leader at DataRobot. Krupa has spent over a decade leading multiple Data Management products and has deep expertise in the space. She has a passion for product innovations that deliver customer value and a proven track record of driving vision to execution.|
This Ask the Expert event is now closed.
Thank you Krupa for being a terrific event host!
Let us know your feedback on this event, suggestions for future events, and look for our next Ask the Expert event coming soon.
Thanks for taking my question: Can you please provide some context on the difference between data prep and ETL?
Hi @sallyS !
Paxata and DataRobot compliment traditional data catalogs. Users typically leverage a traditional catalog to locate data they are looking for and once they find that data they can bring that into DataRobot Paxata for preparation and then leverage prepared data in their AutoML exercise
Yes, you absolutely can. You can schedule a DataRobot Paxata Automatic Project Flow (APF) to go from ingestion of data to data prep to scoring to post scoring data prep steps to export - this end to end workflow can be run as a single Job either on schedule or on-demand through UI/REST API
Great question. Traditional ETL typically caters to Data Engineers and IT developers that are very technical. IT developers receive requirements from business counterparts and implement the requirements into data pipelines. This is a waterfall model with the lifecycle involving requirements gathering, implementation, testing and delivery/acceptance by business. Any further changes that the business needs will start back at the top of that life cycle.
Data Preparation tools on the other hand are built for Business Analysts. Business Analysts can interact with their data and interactively apply data cleansing and data transformation steps. In order to enable Business Analysts to achieve this, Data Preparation tools often embed intelligence and recommendations. For example, DataRobot Paxata can automatically detect Joins across datasets and bring datasets together while this would have traditionally been achieved through SQL scripts by an IT developer in an ETL tool
Another key difference is the nature of use cases. ETL tools have been very successful in loading data to enterprise warehouses where the structure of the data rarely change. Data Preparation tools are helpful when businesses need to work with often changing data and/or new data, as Business Analysts can explore the data and create transformations in an adaptive way.
Ok, great! Thank you for your super response!
So how does the integration between data prep and DataRobot modeling actually work?
Hi @knat ,
Why is it important to work on your full dataset at prep time instead of a sample?
Hi @annapeters0n !
DataRobot Paxata has a tool named 'Predict tool' on the tool panel along side other tools such as Joins, Aggregates, Remove Rows etc. At any point in the Data Prep Project you can add the Predict tool to your Project steps - you will be required to provide your DataRobot API token and with just that the tool will fetch all Deployments from DataRobot along with a desc of the Deployment. You can choose a Deployment from the list and tell the tool if you need prediction explanations returned along with the scores. And that's pretty much all you need to do. You will see the prediction scores come back into the DataPrep Project into the rows of data - at this point you can proceed to add additional Data Prep steps on the data that includes the prediction scores and explanations. You can also spin up Filtergrams (interactive histograms) to explore the prediction scores alongside other columns in the data.
Hi @DaveTheMaster !
Great question. But, my question will be 'why not?'... if you can explore your entire dataset and derive insights instantly, why would you want to be limited to samples?. This is especially helpful if your data has anomalies or characteristics that may potentially be missed in the sample.
Also the key difference between workflow driven data preparation and data driven data preparation exercises is that in the former, the requirements or logic guide your work and in the latter your data prep steps are guided by the actual data. If that's the case, then it is helpful to be guided by the entire data as opposed to being led by a sample.