cancel
Showing results for 
Search instead for 
Did you mean: 

EDA & Data Cleaning

DREnthusiast
Image Sensor

Assuming the following issues, while uploading your dataset to D.R:

  • Disguised Missing Values
  • Excess Zeros
  • Inliers

Does the first pass of EDA (i.e. EDA1), not only detect, but remedy the aforementioned issues? 

 

Because if I click on "Data Quality Assessment", i will see that, pertaining to these issues, it mentions that it has "been handled automatically". Does "automatically" mean during EDA1, or does it refer to the fact that it has this issue memorized in the back end and will take care of it during modeling?

 

The reason i'm suspicious is because if it does remedy these issues during the first pass of EDA, then where are those extra binary columns, flagging the rows where Inliers exist et cetera..? Is it in the backend? Does it exist?

 

While i'm on the topic, does D.R take care of all these issues while modeling (ie. During EDA2)? or does it take care of them during the EDA phase?

 
Labels (1)
3 Replies
Linda
Community Team
Community Team

hey @DREnthusiast ! First, I love your username! Great choice


Re: your questions as to how/when DataRobot handles issues "automatically" - I'm going to give you a quick answer to this question in the hopes that other community members chime in as needed. 

During the modeling step, the blueprints handle the detected issues. Have a look at this product documentation page: Data Quality Assessment.

 

Please let me know if that helps or if you have more questions!

Linda

DREnthusiast
Image Sensor

Hey Linda,

 

Why thank you! I found it very appropriate seeing the amount of gratitude and excitement I have for D.R.

 

As for the answer, yes I suppose you're right. I read a bit more carefully in the documentation you linked. It mentions, for those who are curious, that it 

 

"...adds a binary column inside of a blueprint to ..."

 

Thank you for the clarification Linda!

0 Kudos
jas0n
Data Scientist
Data Scientist

Excellent answer from Linda. The only thing that I would add is that as a rule DR only detects problems in the QA. It's true, most blueprints have some remedies (especially for non-disguised or poorly disguised missing values) however if you want to make sure that all blueprints remain viable then I'd recommend to clean any problematic data first. You could use our Data Prep tool, a Zepl or Jupyter notebook, or any sort of pre-processing code you want to run before moving the data to DR.

Also, if you're in our Cloud, then once Composable ML becomes GA (or if your company is enrolled in the Beta) then you could use that to edit any blueprints (BPs) you like that don't clean your data in the needed way to add those feature engineering steps to the BP.