Showing posts with label variance mismatch. Show all posts
Showing posts with label variance mismatch. Show all posts

Thursday, September 12, 2024

When to Apply StandardScaler: Before or After Splitting Data?

Data Scaling Before vs After Train-Test Split | Machine Learning Guide

Scaling Before vs After Train-Test Split in Machine Learning

๐Ÿ“– Introduction

In machine learning, preprocessing plays a critical role in model performance. One of the most debated topics is whether to scale data before or after splitting into training and test sets.

๐Ÿ’ก Key Idea: The way you scale data can directly affect model accuracy and generalization.

๐Ÿ” Understanding Scaling

Scaling transforms features so they have similar ranges. Common methods include:

  • Standardization (mean = 0, std = 1)
  • Normalization (scaling between 0 and 1)

Most commonly used formula for standardization:

Z = (X - ฮผ) / ฯƒ

Where ฮผ is mean and ฯƒ is standard deviation.

➗ Mathematical Foundation Behind Scaling

Scaling is based on a simple statistical transformation called standardization. It converts raw values into comparable units using mean and standard deviation.

๐Ÿ“Œ Standardization Formula

$$ Z = \frac{X - \mu}{\sigma} $$

Where:

  • $X$ = original value
  • $\mu$ = mean of dataset
  • $\sigma$ = standard deviation
  • $Z$ = standardized value
๐Ÿ”ฝ Expand: Why this works

This formula transforms data so that it has:

  • Mean = 0
  • Standard deviation = 1

This ensures that features with larger numeric ranges do not dominate machine learning models.

๐Ÿ“Œ Mean Formula

$$ \mu = \frac{1}{n} \sum_{i=1}^{n} X_i $$

๐Ÿ“Œ Standard Deviation Formula

$$ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (X_i - \mu)^2} $$

๐Ÿ”ฝ Expand: Intuition

Standard deviation measures how spread out the data is. A higher value means more variability in features.

๐Ÿ’ก Key Insight: Scaling does not change the shape of data — it only rescales it for fair comparison.

⚠️ Scaling Before Train-Test Split

This method computes scaling parameters from the entire dataset before splitting.

๐Ÿ”ฝ Advantages
  • Consistent scaling across all data
  • Useful for very small datasets
  • Reduces variance mismatch
๐Ÿ”ฝ Disadvantages
  • Risk of data leakage
  • Not realistic for production systems
  • Test data influences training preprocessing

✅ Scaling After Train-Test Split

Here, scaling parameters are computed only from the training set and applied to both train and test data.

๐Ÿ”ฝ Advantages
  • No data leakage
  • Real-world simulation
  • Safer for supervised learning
๐Ÿ”ฝ Disadvantages
  • Possible distribution mismatch
  • More preprocessing steps

๐Ÿ“Š Comparison Table

Factor Before Split After Split
Data Leakage Risk exists None
Real-world alignment Poor Excellent
Small dataset handling Good May vary
Production safety Low High

๐ŸŽฏ When to Use Each Approach

Use Scaling Before Split When:

  • Dataset is extremely small
  • Doing unsupervised learning
  • Quick exploratory analysis

Use Scaling After Split When:

  • Building production ML models
  • Working with supervised learning
  • Avoiding data leakage is critical

๐Ÿ’ป CLI / Code Example (Python - sklearn)

❌ Incorrect Approach (Before Split)

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # leakage risk

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

✅ Correct Approach (After Split)

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

๐Ÿง  Best Practices

๐Ÿ”ฝ Expand Guidelines
  • Always split before scaling for supervised learning
  • Use pipelines in sklearn to avoid mistakes
  • Validate preprocessing on unseen data
  • Be consistent in feature transformations

๐ŸŽฏ Key Takeaways

  • Scaling affects model performance significantly
  • Before-split scaling can cause data leakage
  • After-split scaling is safer and production-ready
  • Use context to decide, not habit

๐Ÿ“Œ Final Insight

The safest default in machine learning is to split first, then scale. While exceptions exist, disciplined preprocessing ensures your model behaves reliably in real-world environments.

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts