Excellent, in-depth blog of imbalanced datasets wi... - DataRobot Community

duncanrenfrow · ‎05-11-2020

A common question clients ask me is what to do about imbalanced datasets. Organizations newer to machine learning often have questions about which accuracy metric to use. Organizations with established data science teams often want my opinion on under-sampling, over-sampling, or synthetic data (SMOTE). I've often struggled to concisely convey all of my thoughts on the topics, which was why I was thrilled to read the blog below. It manages in a detailed, yet visual way to take a reader from the original problem (imbalanced datasets) to a discussion of accuracy metrics, the realities of imbalanced dataset using probabilities, potential fixes & pitfalls, and eventually a discussion on how cost (profit) curves can help!

Blog: https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28

In short, this blog covers:

- what F1 Score & ROC / AUC mean in an intuitive but thorough manner

- the important point that sometimes, if you can't accurately distinguish between imbalanced classes, the best thing to do is predict the majority class (this is visually explained beautifully with some 1-D Gaussians)

- some of the issues with under-sampling, over-sampling, and synthetic sampling (i.e. SMOTE)

- the power of using a cost based approach to distinguish between classes

Excellent, in-depth blog of imbalanced datasets with visuals

Excellent, in-depth blog of imbalanced datasets with visuals

Many Fold CV

OTV Partitioning

Dataset split

Data for Visual AI

Model factory for clustered time series models