XGBoost Similarity Formula
馃摎 Table of Contents
- Core Idea
- Why Do We Need It?
- The Formula
- What are G and H?
- Simple Intuition
- How Splitting Works
- Numerical Example
- Code Example
- CLI Output
- Key Takeaways
馃 Core Idea
XGBoost builds trees step by step. At each step, it asks:
The similarity formula helps answer this question.
❓ Why Do We Need It?
In normal decision trees:
- We use Gini or entropy
But in XGBoost:
- We optimize a loss function (like error)
- We need a smarter way to measure improvement
馃搻 The Similarity Formula
Similarity = G² / (H + 位)
Where:
- G = total gradient
- H = total hessian
- 位 = regularization (controls overfitting)
馃攳 What are G and H?
Gradient (G)
Tells how wrong the prediction is.
Hessian (H)
Tells how confident we are about that error.
Gradient = direction to improve
Hessian = confidence in that direction
馃挕 Intuition Behind Formula
Think of similarity as:
- High G → big error → potential improvement
- High H → stable information
- 位 → prevents overfitting
馃尦 How XGBoost Chooses Splits
- Try a split
- Calculate similarity for left node
- Calculate similarity for right node
- Compare with parent node
Gain formula:
Gain = Left + Right - Parent
馃搳 Simple Example
Assume:
G = 10 H = 5 位 = 1
Similarity:
= 10² / (5 + 1) = 100 / 6 ≈ 16.67
Higher value = better node
馃捇 Code Example
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2
)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
馃枼 CLI Output
Accuracy: 0.96
馃幆 Key Takeaways
馃摎 Related Articles
馃殌 Final Thought
The similarity formula is the brain of XGBoost trees. Once you understand it, the whole algorithm becomes much easier to follow.