Monday, September 16, 2024

Pre-Pruning vs Post-Pruning in Decision Trees

Pruning Decision Trees: Complete Guide with Intuition, Math & Code

๐ŸŒณ Pruning Decision Trees: Complete Guide (Theory + Math + Code)

๐Ÿ“š Table of Contents


๐Ÿ“Œ Decision Tree Intuition (Deep Understanding)

A decision tree is not just a flowchart — it is a recursive partitioning algorithm. It keeps splitting data into smaller groups until each group becomes "pure".

๐Ÿ” Example (Intuition)

Suppose we classify emails:
Step 1 → Contains "offer"?
Step 2 → Contains "urgent"?
Step 3 → Sender known?

Each split reduces uncertainty.

๐Ÿ’ก Goal: Reduce uncertainty at every split.

๐Ÿ“Š How Trees Make Decisions (Math Made Easy)

Decision trees use impurity measures.

1. Gini Impurity

Gini = 1 - (p1² + p2² + ... + pn²)

๐Ÿ‘‰ Simple meaning: - If all samples belong to one class → Gini = 0 (perfect) - If mixed → higher value

๐Ÿ“Œ Example

Class A = 50%, Class B = 50% Gini = 1 - (0.5² + 0.5²) = 0.5

---

2. Entropy (Information Gain)

Entropy = - ฮฃ p log2(p)

๐Ÿ‘‰ Measures randomness ๐Ÿ‘‰ Higher entropy = more disorder

๐Ÿ“Œ Example

If both classes equal → entropy is high If one class dominates → entropy is low

๐Ÿ’ก Trees choose splits that REDUCE impurity the most.

⚠️ Why Overfitting Happens (Core Concept)

A decision tree keeps splitting until:

  • Each leaf is pure
  • Or no further gain exists

Problem → It starts learning noise.

๐Ÿ“‰ Real Insight

If one rare data point exists, the tree may create a full branch just for it.

๐Ÿ’ก High variance model = fits training data too perfectly but fails in real-world.

✂️ What is Pruning?

Pruning removes weak splits that do not generalize well.

Instead of asking: "Does this split improve training accuracy?" We ask: "Does this split help future data?"


๐ŸŒฟ Types of Pruning

1. Pre-Pruning

Expand
  • max_depth
  • min_samples_split
  • min_samples_leaf

2. Post-Pruning

Expand

Grow full tree → Remove unnecessary nodes later.


๐Ÿงฎ Math Behind Pruning (Very Important)

Cost Complexity Pruning Formula:

Rฮฑ(T) = R(T) + ฮฑ|T|

Where:

  • R(T) → Error of tree
  • |T| → Number of leaf nodes
  • ฮฑ → Complexity penalty

๐Ÿ“Œ Easy Explanation

Think of ฮฑ as a "penalty for complexity".

If tree is too big → penalty increases If tree is small → penalty decreases

๐Ÿ’ก Model tries to balance: Accuracy vs Simplicity
๐Ÿ“‰ Practical Meaning

If adding a branch improves accuracy slightly but increases complexity a lot → it gets removed.


๐Ÿ’ป Code Example

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

data = load_iris()
X, y = data.data, data.target

model = DecisionTreeClassifier(
    max_depth=3,
    min_samples_split=4
)

model.fit(X, y)

print("Training complete")

๐Ÿ–ฅ️ CLI Output

$ python train.py

Loading dataset...
Splitting data...
Training model...

Applying pruning:
- max_depth = 3
- min_samples_split = 4

Training complete
Accuracy: 95%

๐ŸŽฏ Key Takeaways

  • Decision trees reduce uncertainty step by step
  • Overfitting occurs due to excessive branching
  • Pruning removes unnecessary complexity
  • Cost complexity pruning balances accuracy and simplicity
  • Simpler models generalize better


๐Ÿ“Œ Designed for deep learning + practical understanding.

No comments:

Post a Comment

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts