Showing posts with label Similarity Formula. Show all posts
Showing posts with label Similarity Formula. Show all posts

Wednesday, September 18, 2024

How the Similarity Formula Works in XGBoost Trees

XGBoost Similarity Formula Explained Simply (With Intuition & Example)

XGBoost Similarity Formula

馃摎 Table of Contents


馃 Core Idea

XGBoost builds trees step by step. At each step, it asks:

馃挕 “If I split the data here, does my prediction improve?”

The similarity formula helps answer this question.


❓ Why Do We Need It?

In normal decision trees:

  • We use Gini or entropy

But in XGBoost:

  • We optimize a loss function (like error)
  • We need a smarter way to measure improvement
馃挕 So instead of “purity”, XGBoost measures “how much error reduces”

馃搻 The Similarity Formula

Similarity = G² / (H + 位)

Where:

  • G = total gradient
  • H = total hessian
  • 位 = regularization (controls overfitting)

馃攳 What are G and H?

Gradient (G)

Tells how wrong the prediction is.

Hessian (H)

Tells how confident we are about that error.

馃挕 Simple:
Gradient = direction to improve
Hessian = confidence in that direction

馃挕 Intuition Behind Formula

Think of similarity as:

馃挕 “How strong is this group of data?”
  • High G → big error → potential improvement
  • High H → stable information
  • 位 → prevents overfitting

馃尦 How XGBoost Chooses Splits

  1. Try a split
  2. Calculate similarity for left node
  3. Calculate similarity for right node
  4. Compare with parent node

Gain formula:

Gain = Left + Right - Parent
馃挕 Highest gain = best split

馃搳 Simple Example

Assume:

G = 10
H = 5
位 = 1

Similarity:

= 10² / (5 + 1)
= 100 / 6
≈ 16.67

Higher value = better node


馃捇 Code Example

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2
)

model = xgb.XGBClassifier()
model.fit(X_train, y_train)

print(model.score(X_test, y_test))

馃枼 CLI Output

Accuracy: 0.96

馃幆 Key Takeaways

✔ XGBoost uses gradients instead of simple counts ✔ Similarity measures node quality ✔ Gain decides best split ✔ 位 prevents overfitting ✔ Better splits → better predictions

馃摎 Related Articles


馃殌 Final Thought

The similarity formula is the brain of XGBoost trees. Once you understand it, the whole algorithm becomes much easier to follow.

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts