Scaling Before vs After Train-Test Split in Machine Learning
๐ Table of Contents
๐ Introduction
In machine learning, preprocessing plays a critical role in model performance. One of the most debated topics is whether to scale data before or after splitting into training and test sets.
๐ Understanding Scaling
Scaling transforms features so they have similar ranges. Common methods include:
- Standardization (mean = 0, std = 1)
- Normalization (scaling between 0 and 1)
Most commonly used formula for standardization:
Z = (X - ฮผ) / ฯ
Where ฮผ is mean and ฯ is standard deviation.
➗ Mathematical Foundation Behind Scaling
Scaling is based on a simple statistical transformation called standardization. It converts raw values into comparable units using mean and standard deviation.
๐ Standardization Formula
$$ Z = \frac{X - \mu}{\sigma} $$
Where:
- $X$ = original value
- $\mu$ = mean of dataset
- $\sigma$ = standard deviation
- $Z$ = standardized value
๐ฝ Expand: Why this works
This formula transforms data so that it has:
- Mean = 0
- Standard deviation = 1
This ensures that features with larger numeric ranges do not dominate machine learning models.
๐ Mean Formula
$$ \mu = \frac{1}{n} \sum_{i=1}^{n} X_i $$
๐ Standard Deviation Formula
$$ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (X_i - \mu)^2} $$
๐ฝ Expand: Intuition
Standard deviation measures how spread out the data is. A higher value means more variability in features.
⚠️ Scaling Before Train-Test Split
This method computes scaling parameters from the entire dataset before splitting.
๐ฝ Advantages
- Consistent scaling across all data
- Useful for very small datasets
- Reduces variance mismatch
๐ฝ Disadvantages
- Risk of data leakage
- Not realistic for production systems
- Test data influences training preprocessing
✅ Scaling After Train-Test Split
Here, scaling parameters are computed only from the training set and applied to both train and test data.
๐ฝ Advantages
- No data leakage
- Real-world simulation
- Safer for supervised learning
๐ฝ Disadvantages
- Possible distribution mismatch
- More preprocessing steps
๐ Comparison Table
| Factor | Before Split | After Split |
|---|---|---|
| Data Leakage | Risk exists | None |
| Real-world alignment | Poor | Excellent |
| Small dataset handling | Good | May vary |
| Production safety | Low | High |
๐ฏ When to Use Each Approach
Use Scaling Before Split When:
- Dataset is extremely small
- Doing unsupervised learning
- Quick exploratory analysis
Use Scaling After Split When:
- Building production ML models
- Working with supervised learning
- Avoiding data leakage is critical
๐ป CLI / Code Example (Python - sklearn)
❌ Incorrect Approach (Before Split)
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # leakage risk X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
✅ Correct Approach (After Split)
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
๐ง Best Practices
๐ฝ Expand Guidelines
- Always split before scaling for supervised learning
- Use pipelines in sklearn to avoid mistakes
- Validate preprocessing on unseen data
- Be consistent in feature transformations
๐ฏ Key Takeaways
- Scaling affects model performance significantly
- Before-split scaling can cause data leakage
- After-split scaling is safer and production-ready
- Use context to decide, not habit
๐ Final Insight
The safest default in machine learning is to split first, then scale. While exceptions exist, disciplined preprocessing ensures your model behaves reliably in real-world environments.
No comments:
Post a Comment