I'm working on a classification problem and my dataset is larger then 5gb which is the limit for my local import to Datarobot. What is the best approach to this problem?
Can I train a model on a 5gb import then train it further on a separate import for instance?
Any thoughts are appreciated even if the answer is that this is a hard limit.
Solved! Go to Solution.
If color is not important for identification, you can change the images to grayscale.
Ahhh I see, that's great information @Anonymous thanks, I'll try find a heuristics else accept the resizing.
But if you can find heuristics to crop the image before ingestion, that could help. For example, I once had to ingest MRI images as part of a data science competition, and I was able to automatically crop the image by cutting out where the sides were pure black i.e. outside the range of the human body parts (human bodies appear grey in MRI images) being scanned.
When using DataRobot's Visual AI, you don't lose any information by resizing images down to 224x224 versus using full size for data ingestion, because DataRobot automatically resizes all images to 224x244 as soon at it processes the image file when training or scoring.
Thanks @Anonymous, I'll definitely try resizing. I could lose information on what I want to identify by just scaling the image down. Would be great to be able to use the activation maps from DR to crop out the most likely useless pixels.
Here are two image pre-processing steps that will reduce your data size without any impact on DataRobot's Visual AI accuracy:
1) resize images so that they are no larger than 224x224 pixels
2) save image files as either png or jpg
@dalilaB thanks I can definitely try down sampling. The reason for my large dataset is that it contains a lot of image data, is there any pre processing steps you would recommend on image data to reduce their size before import? By combine them from the best models do you mean I can make an ensemble model from the 4 projects?
My experience with datasets larger 5Gb and over 10,000 features, you can just downsample and still get as good or better results than using all the datasets. This has to do with statistics wisdom. The type of features (numeric, categorical, text, etc) and their interaction, and of course the cleanliness of the data drive how much data you need. Independent features require fewer data. In fact, if you are 10 numerical features, a dataset of size 1000 will be more than enough.
If you are afraid of losing something, then one approach is to divide your dataset into 5 datasets ( chosen randomly) and then create 4 projects, and then score the 5th dataset. Then combine them from the best models. However, I suspect the performance among them will be similar.