XGBoost Similarity Formula
馃摎 Table of Contents
- Core Idea
- Why Do We Need It?
- The Formula
- What are G and H?
- Simple Intuition
- How Splitting Works
- Numerical Example
- Code Example
- CLI Output
- Key Takeaways
馃 Core Idea
XGBoost builds trees step by step. At each step, it asks:
The similarity formula helps answer this question.
❓ Why Do We Need It?
In normal decision trees:
- We use Gini or entropy
But in XGBoost:
- We optimize a loss function (like error)
- We need a smarter way to measure improvement
馃搻 The Similarity Formula
Similarity = G² / (H + 位)
Where:
- G = total gradient
- H = total hessian
- 位 = regularization (controls overfitting)
馃攳 What are G and H?
Gradient (G)
Tells how wrong the prediction is.
Hessian (H)
Tells how confident we are about that error.
Gradient = direction to improve
Hessian = confidence in that direction
馃挕 Intuition Behind Formula
Think of similarity as:
- High G → big error → potential improvement
- High H → stable information
- 位 → prevents overfitting
馃尦 How XGBoost Chooses Splits
- Try a split
- Calculate similarity for left node
- Calculate similarity for right node
- Compare with parent node
Gain formula:
Gain = Left + Right - Parent
馃搳 Simple Example
Assume:
G = 10 H = 5 位 = 1
Similarity:
= 10² / (5 + 1) = 100 / 6 ≈ 16.67
Higher value = better node
馃捇 Code Example
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2
)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
馃枼 CLI Output
Accuracy: 0.96
馃幆 Key Takeaways
馃摎 Related Articles
馃殌 Final Thought
The similarity formula is the brain of XGBoost trees. Once you understand it, the whole algorithm becomes much easier to follow.
No comments:
Post a Comment