Monday, September 16, 2024

A Beginner’s Guide to Decision Trees: Understanding ID3, CART, and C4.5

Decision Trees: ID3 vs CART vs C4.5 (Complete Beginner to Advanced Guide)

๐ŸŒณ Decision Trees: ID3 vs CART vs C4.5 (Complete Guide)

๐Ÿ“š Table of Contents

๐Ÿ“– Introduction

Decision trees are powerful yet intuitive machine learning models that mimic human decision-making.

They split data step by step, forming a tree-like structure.

๐ŸŒฟ What is a Decision Tree?

Think of it like a flowchart:

  • Start with a question
  • Split into branches
  • Reach a final decision
Real Life Example

Should you go outside?

  • Is it raining?
  • Do you have time?
  • Are you tired?

๐Ÿ“ Mathematics Behind Decision Trees (Detailed Explanation)

1. Entropy (Measure of Uncertainty)

Entropy tells us how mixed or impure a dataset is.

Entropy(S) = - ฮฃ p(x) log₂ p(x)

Explanation:

  • p(x) = probability of each class
  • If all data belongs to one class → Entropy = 0 (perfectly pure)
  • If data is evenly split → Entropy is maximum

Example:

Dataset: 50% Yes, 50% No
Entropy = -0.5 log2(0.5) - 0.5 log2(0.5)
= 1 (maximum uncertainty)

Key Idea: Higher entropy = more disorder = worse split

2. Information Gain (Reduction in Uncertainty)

Gain = Entropy(parent) - Weighted Entropy(children)

Explanation:

  • Measures how much uncertainty is reduced after a split
  • Higher gain = better feature

Simple Intuition:

If a question (feature) gives you clear answers → it has high information gain.

3. Gini Impurity (Used in CART)

Gini = 1 - ฮฃ (p²)

Explanation:

  • Measures probability of misclassification
  • If all samples belong to one class → Gini = 0
  • Lower Gini = better split

Example:

p(Yes)=0.7, p(No)=0.3
Gini = 1 - (0.7² + 0.3²)
= 1 - (0.49 + 0.09)
= 0.42

4. Gain Ratio (Improvement over ID3)

Gain Ratio = Information Gain / Split Info

Why needed?

  • ID3 favors features with many categories
  • Gain Ratio penalizes such features

5. Mean Squared Error (For Regression in CART)

MSE = ฮฃ (Actual - Predicted)² / n

Explanation:

  • Measures how far predictions are from actual values
  • Lower MSE = better split
Why These Math Concepts Matter?

All decision tree algorithms rely on these formulas to decide the best question to ask at each step.

They ensure the model becomes more accurate as it grows.

๐Ÿง  ID3 Algorithm

๐Ÿง  ID3 Algorithm

ID3 (Iterative Dichotomiser 3) is one of the earliest and simplest decision tree algorithms. It works by selecting the feature that provides the highest Information Gain at each step.

Step-by-Step Working of ID3
  • Start with the full dataset
  • Calculate entropy for the target variable
  • For each feature, calculate information gain
  • Select the feature with the highest gain
  • Split the dataset based on that feature
  • Repeat recursively until all data is pure

Example: Suppose we want to predict "Play Tennis" based on weather. ID3 will choose the feature (Outlook, Humidity, etc.) that best reduces uncertainty.

Key Insight: ID3 prefers features that create the most "pure" splits (least randomness).

Limitations Explained

  • Biased toward features with many categories
  • No handling of missing values
  • No pruning → can overfit

⚙️ CART Algorithm

CART (Classification and Regression Trees) is one of the most widely used decision tree algorithms in real-world applications.

How CART Works Step-by-Step
  • Start with full dataset
  • Try all possible splits for all features
  • Calculate Gini Impurity (classification) or MSE (regression)
  • Select the split with lowest impurity
  • Repeat recursively
  • Stop when stopping criteria met

Understanding Gini Intuitively

Gini measures how often a randomly chosen element would be incorrectly labeled.

Gini = 1 - (p₁² + p₂² + ...)

Lower Gini = Better Split

Why Binary Splits?

CART always splits into two branches, making it computationally efficient and easier to optimize.

Pruning in CART

CART uses cost complexity pruning to remove unnecessary branches and prevent overfitting.

๐Ÿš€ C4.5 Algorithm

C4.5 improves ID3 by solving its key weaknesses.

Step-by-Step Working
  • Compute information gain for each feature
  • Normalize using Gain Ratio
  • Select best feature
  • Handle continuous data by finding thresholds
  • Handle missing values
  • Apply pruning after tree construction

Why Gain Ratio?

ID3 favors features with many values. Gain Ratio fixes this by penalizing such splits.

Gain Ratio = Gain / Split Info

Handling Continuous Data

C4.5 converts continuous features into binary splits using thresholds like:

Age > 30 ?

Handling Missing Values

C4.5 assigns probabilities instead of discarding data.

Pruning Advantage

Post-pruning reduces overfitting and improves generalization.

๐Ÿ“Š Comparison

๐Ÿ“Š Comparison
FeatureID3CARTC4.5
Data TypeCategoricalBothBoth
MetricEntropyGiniGain Ratio
Tree TypeMultiBinaryMulti

๐Ÿ’ป Code Example

from sklearn.tree import DecisionTreeClassifier

X = [[0,0],[1,1]]
y = [0,1]

model = DecisionTreeClassifier()
model.fit(X,y)

print(model.predict([[2,2]]))

๐Ÿ–ฅ️ CLI Output

$ python tree.py
[1]

๐Ÿ’ก Key Takeaways

  • ID3 = Simple but limited
  • CART = Most practical
  • C4.5 = Advanced ID3
  • Math drives decision splits

๐Ÿ“˜ Full Example: Play Tennis Dataset

Let’s walk through a classic dataset used to understand decision trees.

Dataset Overview
Outlook | Temp | Humidity | Wind | Play
Sunny   | Hot  | High     | Weak | No
Sunny   | Hot  | High     | Strong | No
Overcast| Hot  | High     | Weak | Yes
Rain    | Mild | High     | Weak | Yes

ID3 will calculate entropy and choose the best feature (usually Outlook).

Tree Structure (Simplified)

        Outlook
       /   |   \
   Sunny Overcast Rain
    No      Yes    ?

๐ŸŒณ Visual Understanding

A decision tree grows top-down. Each split reduces uncertainty.

How Tree Grows
  • Root node = best feature
  • Branches = decisions
  • Leaves = final prediction

๐Ÿง  Quick Quiz

Question 1

Which algorithm uses Gini impurity?

Answer: CART

Question 2

Which algorithm handles continuous data better than ID3?

Answer: C4.5

Question 3

Which algorithm supports regression?

Answer: CART

๐ŸŽฏ Interview Questions

  • What is entropy in decision trees?
  • Difference between Gini and entropy?
  • Why does overfitting happen in trees?
  • Explain pruning techniques

๐Ÿ“Š Step-by-Step Entropy & Information Gain (Worked Example)

Consider a small dataset with target Play having 9 Yes and 5 No.

Entropy of Dataset

Entropy(S) = - (9/14)log2(9/14) - (5/14)log2(5/14)
≈ - (0.64 * -0.64) - (0.36 * -1.47)
≈ 0.94

This represents the uncertainty before any split.

Split by Outlook

Sunny: 2 Yes, 3 No
Overcast: 4 Yes, 0 No
Rain: 3 Yes, 2 No

Entropy After Split

Entropy(Sunny) ≈ 0.97
Entropy(Overcast) = 0
Entropy(Rain) ≈ 0.97

Information Gain

Gain = 0.94 - weighted entropy ≈ 0.246

Conclusion: Outlook gives a strong split → chosen as root.

๐Ÿ“ˆ Gini vs Entropy (Visual Intuition)

Both measure impurity but behave slightly differently:

  • Entropy is logarithmic → more sensitive
  • Gini is faster → preferred in practice
Key Insight

Both aim to create pure nodes. Choice often depends on speed vs precision tradeoff.

No comments:

Post a Comment

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts