๐ L1 vs L2 Regularization: A Deep Interactive Guide
๐ Table of Contents
- Introduction
- Understanding Overfitting
- What is Regularization?
- L1 Regularization (Lasso)
- L2 Regularization (Ridge)
- Mathematical Explanation
- Key Differences
- Code Examples
- CLI Output
- Key Takeaways
- Related Articles
๐ Introduction
In machine learning, building a model that performs well on unseen data is the ultimate goal. However, models often become too complex and start memorizing training data instead of learning patterns.
⚠️ Understanding Overfitting
Overfitting occurs when a model captures noise along with the underlying pattern.
- High accuracy on training data
- Poor performance on test data
๐ Why does overfitting happen?
When models have too many parameters, they can perfectly fit training data—even random noise. This reduces their ability to generalize.
๐ง What is Regularization?
Regularization is a technique used to reduce model complexity by penalizing large weights.
It modifies the loss function:
Loss = Original Loss + Penalty Term
๐น L1 Regularization (Lasso)
L1 adds a penalty based on absolute values of weights.
๐ Formula
L1 = ฮป * ฮฃ |wi|
๐ฏ Effect
- Pushes weights to zero
- Performs feature selection
- Creates sparse models
๐ Deep Insight
L1 regularization creates sharp corners in optimization space, causing some coefficients to become exactly zero.
๐น L2 Regularization (Ridge)
L2 adds a penalty based on squared weights.
๐ Formula
L2 = ฮป * ฮฃ (wi²)
๐ฏ Effect
- Shrinks weights smoothly
- Keeps all features
- Improves stability
๐ Deep Insight
L2 creates a smooth penalty surface, leading to balanced weight distribution instead of elimination.
๐ Mathematical Intuition
The full loss function becomes:
L = ฮฃ (yi - ลทi)² + ฮป * penalty
For L1:
L = ฮฃ (yi - ลทi)² + ฮป ฮฃ |wi|
For L2:
L = ฮฃ (yi - ลทi)² + ฮป ฮฃ (wi²)
๐ Why does this work?
Adding penalties discourages large weights, preventing the model from relying too heavily on specific features.
๐ Deep Mathematical Explanation
To truly understand regularization, we need to look at how it changes the optimization problem.
๐น Base Loss Function (Without Regularization)
L = ฮฃ (yi - ลทi)²
This objective tries to minimize prediction error. However, it does not restrict model complexity.
๐น L1 Regularization (Lasso)
L = ฮฃ (yi - ลทi)² + ฮป ฮฃ |wi|
L1 adds a penalty proportional to the absolute values of weights.
- Encourages sparsity
- Creates sharp optimization boundaries
- Forces some weights exactly to zero
๐ Geometric Intuition
L1 regularization creates a diamond-shaped constraint region. The corners of this shape align with axes, which is why optimization often lands exactly on zero values for some weights.
๐น L2 Regularization (Ridge)
L = ฮฃ (yi - ลทi)² + ฮป ฮฃ (wi²)
L2 penalizes squared weights, leading to smoother optimization.
- Shrinks weights continuously
- No exact zero values
- Distributes importance across features
๐ Geometric Intuition
L2 creates a circular constraint region. Since there are no sharp corners, weights rarely become zero— they are just reduced proportionally.
๐น Gradient Perspective
During training, weights are updated using gradients.
L1 Gradient:
∂L/∂wi = error_gradient + ฮป * sign(wi)
L2 Gradient:
∂L/∂wi = error_gradient + 2ฮปwi
๐ Why This Matters
L1 applies a constant push toward zero, while L2 applies a proportional shrink. This is the key reason why L1 creates sparsity and L2 does not.
๐น Choosing Lambda (ฮป)
- ฮป = 0 → No regularization
- Small ฮป → Slight control
- Large ฮป → Strong simplification
⚖️ Key Differences
| Aspect | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Sparsity | Yes | No |
| Feature Selection | Yes | No |
| Stability | Less stable | More stable |
| Best Use Case | High-dimensional data | General purpose |
๐ป Code Example
from sklearn.linear_model import Lasso, Ridge # L1 Regularization lasso = Lasso(alpha=0.1) lasso.fit(X, y) # L2 Regularization ridge = Ridge(alpha=0.1) ridge.fit(X, y)
๐ฅ CLI Output Sample
Training Model... Epoch 1/5 Loss: 12.45 Epoch 5/5 Loss: 4.32 L1 Weights: [0.0, 1.2, 0.0, 3.4] L2 Weights: [0.5, 1.1, 0.8, 2.9]
๐ Expand Explanation
Notice how L1 forces some weights to zero, while L2 keeps all weights but reduces their magnitude.
๐ฏ Key Takeaways
- L1 = Feature Selection
- L2 = Weight Shrinking
- Both reduce overfitting
- Lambda controls penalty strength
๐ Final Thoughts
Regularization is essential for building robust machine learning models. Choosing between L1 and L2 depends on your problem, data size, and feature characteristics.
Mastering these techniques ensures your models perform well not just in training—but in the real world.
No comments:
Post a Comment