Solved: Re: Working with Data larger then 5gb - DataRobot Community

IraWatt · ‎02-15-2022

I'm working on a classification problem and my dataset is larger then 5gb which is the limit for my local import to Datarobot. What is the best approach to this problem?

Can I train a model on a 5gb import then train it further on a separate import for instance?

Any thoughts are appreciated even if the answer is that this is a hard limit.

Thanks,

Ira

dalilaB · ‎02-15-2022

My experience with datasets larger 5Gb and over 10,000 features, you can just downsample and still get as good or better results than using all the datasets. This has to do with statistics wisdom. The type of features (numeric, categorical, text, etc) and their interaction, and of course the cleanliness of the data drive how much data you need. Independent features require fewer data. In fact, if you are 10 numerical features, a dataset of size 1000 will be more than enough.
If you are afraid of losing something, then one approach is to divide your dataset into 5 datasets ( chosen randomly) and then create 4 projects, and then score the 5th dataset. Then combine them from the best models. However, I suspect the performance among them will be similar.

View solution in original post

Inactive · ‎02-16-2022

Here are two image pre-processing steps that will reduce your data size without any impact on DataRobot's Visual AI accuracy:
1) resize images so that they are no larger than 224x224 pixels
2) save image files as either png or jpg

View solution in original post

dalilaB · ‎02-15-2022

My experience with datasets larger 5Gb and over 10,000 features, you can just downsample and still get as good or better results than using all the datasets. This has to do with statistics wisdom. The type of features (numeric, categorical, text, etc) and their interaction, and of course the cleanliness of the data drive how much data you need. Independent features require fewer data. In fact, if you are 10 numerical features, a dataset of size 1000 will be more than enough.
If you are afraid of losing something, then one approach is to divide your dataset into 5 datasets ( chosen randomly) and then create 4 projects, and then score the 5th dataset. Then combine them from the best models. However, I suspect the performance among them will be similar.

IraWatt · ‎02-16-2022

@dalilaB thanks I can definitely try down sampling. The reason for my large dataset is that it contains a lot of image data, is there any pre processing steps you would recommend on image data to reduce their size before import? By combine them from the best models do you mean I can make an ensemble model from the 4 projects?

Inactive · ‎02-16-2022

Here are two image pre-processing steps that will reduce your data size without any impact on DataRobot's Visual AI accuracy:
1) resize images so that they are no larger than 224x224 pixels
2) save image files as either png or jpg

IraWatt · ‎02-16-2022

Thanks @Inactive, I'll definitely try resizing. I could lose information on what I want to identify by just scaling the image down. Would be great to be able to use the activation maps from DR to crop out the most likely useless pixels.

Inactive · ‎02-16-2022

Hi Ira,

When using DataRobot's Visual AI, you don't lose any information by resizing images down to 224x224 versus using full size for data ingestion, because DataRobot automatically resizes all images to 224x244 as soon at it processes the image file when training or scoring.

Colin

Inactive · ‎02-16-2022

But if you can find heuristics to crop the image before ingestion, that could help. For example, I once had to ingest MRI images as part of a data science competition, and I was able to automatically crop the image by cutting out where the sides were pure black i.e. outside the range of the human body parts (human bodies appear grey in MRI images) being scanned.

Colin

IraWatt · ‎02-16-2022

Ahhh I see, that's great information @Inactive thanks, I'll try find a heuristics else accept the resizing.

dalilaB · ‎02-17-2022

If color is not important for identification, you can change the images to grayscale.

Working with Data larger then 5gb

Working with Data larger then 5gb

Modeling

Oracle

How to make your own lagged features

Google Ads use case

Feature Generation

Downloaded Predictions do not Match Targets