Monday, September 16, 2024

Pruning Decision Trees: Simplifying Machine Learning Models for Better Accuracy

Pre-Pruning vs Post-Pruning in Decision Trees | Complete Guide

๐ŸŒณ Pre-Pruning vs Post-Pruning in Decision Trees

Imagine you're a gardener growing a tree. As it grows, branches spread everywhere. Some are strong and useful, while others are weak and unnecessary. To ensure healthy growth, you prune the weak branches.

In Machine Learning, a Decision Tree behaves exactly like this. It grows branches (decisions), but without control, it becomes overly complex.


๐Ÿ“Œ Table of Contents


๐ŸŒฟ Understanding Through Analogy

A growing decision tree splits data repeatedly. However:

  • Too many branches → Overfitting
  • Too few branches → Underfitting

Pruning ensures the model grows in a balanced and meaningful way.

๐Ÿ“– Deep Explanation

A decision tree recursively partitions data based on features. Each split increases model complexity. Without constraints, the model memorizes training data instead of learning patterns.


⚠️ Why Do We Prune Decision Trees?

  • Avoid Overfitting: Prevent memorizing noise
  • Improve Interpretability: Simpler trees are easier to understand
  • Enhance Efficiency: Faster predictions
๐Ÿ“– Expand Explanation

Overfitting happens when a model captures random fluctuations instead of actual patterns. Pruning removes such noise-driven splits.


✂️ Pre-Pruning (Early Stopping)

Stops the tree from growing too large during training.

When to Use?

  • Limited data
  • Need faster training
  • Simple model preferred

Common Criteria

  • Max Depth
  • Min Samples Split
  • Min Samples Leaf
๐Ÿ“– Detailed Theory

Pre-pruning applies constraints before splits happen. If conditions aren't met, the split is not created. This reduces complexity early but may miss important patterns.


๐ŸŒณ Post-Pruning (Cost Complexity Pruning)

First grow full tree → Then remove weak branches.

When to Use?

  • Large datasets
  • Need best performance
  • Want detailed analysis
๐Ÿ“– How It Works

Each branch is evaluated using a penalty factor (alpha). Branches contributing less are removed. This balances accuracy vs complexity.


⚖️ Pre vs Post Pruning

Feature Pre-Pruning Post-Pruning
Timing Before full growth After full growth
Speed Fast Slower
Accuracy May miss patterns Better generalization

๐Ÿ’ป Code Example (Python - Scikit-Learn)

from sklearn.tree import DecisionTreeClassifier

# Pre-Pruning
model = DecisionTreeClassifier(max_depth=3, min_samples_split=5)
model.fit(X_train, y_train)

# Post-Pruning
path = model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

pruned_model = DecisionTreeClassifier(ccp_alpha=0.01)
pruned_model.fit(X_train, y_train)

๐Ÿ–ฅ️ CLI Output Example

Training Decision Tree...
Initial Depth: 12
After Pre-Pruning Depth: 3
After Post-Pruning Depth: 5

Accuracy:
Training: 98%
Validation: 91%

๐Ÿ’ก Key Takeaways

  • Pruning prevents overfitting
  • Pre-pruning is faster but less flexible
  • Post-pruning gives better results
  • Simpler models generalize better


๐Ÿ“Œ Final Thought

Pruning is not just optimization — it's discipline in modeling. The goal is not the most complex tree, but the most reliable and generalizable one.

No comments:

Post a Comment

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts