⚖️ Imbalanced Datasets in Machine Learning – A Complete Guide
In real-world machine learning, data is rarely perfect. One of the most common and tricky problems is dealing with imbalanced datasets.
๐ Table of Contents
- What is Imbalanced Data?
- Why It’s a Problem
- Evaluation Metrics (with Math)
- Handling Techniques
- Code Example
- CLI Output
- Fraud Detection Case
- Key Takeaways
- Related Articles
๐ What is an Imbalanced Dataset?
An imbalanced dataset occurs when class distribution is uneven.
| Class | Percentage |
|---|---|
| Non-Fraud | 95% |
| Fraud | 5% |
This makes learning difficult because the model sees very few examples of the important class.
๐จ Why It’s a Problem
A model can cheat:
\[ Accuracy = \frac{Correct\ Predictions}{Total\ Predictions} \]
If it predicts everything as majority class:
\[ Accuracy = 95\% \]
๐ Evaluation Metrics (Simple Math)
1. Precision
\[ Precision = \frac{TP}{TP + FP} \]
How many predicted positives are correct.
2. Recall
\[ Recall = \frac{TP}{TP + FN} \]
How many real positives are detected.
3. F1 Score
\[ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \]
Balance between precision and recall.
4. ROC-AUC
Measures performance across thresholds.
๐ ️ Techniques to Handle Imbalance
1. Resampling
- Oversampling → Duplicate minority
- Undersampling → Reduce majority
2. SMOTE
Creates synthetic samples:
\[ New\ Sample = x_i + \lambda(x_{neighbor} - x_i) \]
Where \( \lambda \) is random between 0 and 1.
3. Class Weights
Modify loss:
\[ Loss = Weight \times Error \]
Minority gets higher penalty.
4. Better Algorithms
- Random Forest ๐ณ
- Gradient Boosting ๐
- Weighted Decision Trees
5. Anomaly Detection
Focus only on rare events.
๐ป Code Example
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)
๐ฅ️ CLI Output
View Output
Precision: 0.78 Recall: 0.82 F1 Score: 0.80 ROC-AUC: 0.91
๐ณ Real Example – Fraud Detection
Without handling imbalance:
- Accuracy: 95%
- Fraud detected: 0%
After applying SMOTE + weighting:
- Accuracy: 92%
- Fraud detected: 85%
๐ก Key Takeaways
- Accuracy is misleading in imbalanced data
- Use precision, recall, F1
- SMOTE improves minority learning
- Class weighting is powerful
- Always evaluate real-world impact
๐ฏ Final Thoughts
Handling imbalanced datasets isn’t optional—it’s essential.
Because in most real-world problems, the rare cases are the ones that matter the most.
No comments:
Post a Comment