A common question clients ask me is what to do about imbalanced datasets. Organizations newer to machine learning often have questions about which accuracy metric to use. Organizations with established data science teams often want my opinion on under-sampling, over-sampling, or synthetic data (SMOTE). I've often struggled to concisely convey all of my thoughts on the topics, which was why I was thrilled to read the blog below. It manages in a detailed, yet visual way to take a reader from the original problem (imbalanced datasets) to a discussion of accuracy metrics, the realities of imbalanced dataset using probabilities, potential fixes & pitfalls, and eventually a discussion on how cost (profit) curves can help!
- what F1 Score & ROC / AUC mean in an intuitive but thorough manner
- the important point that sometimes, if you can't accurately distinguish between imbalanced classes, the best thing to do is predict the majority class (this is visually explained beautifully with some 1-D Gaussians)
- some of the issues with under-sampling, over-sampling, and synthetic sampling (i.e. SMOTE)
- the power of using a cost based approach to distinguish between classes