Showing posts with label clustering methods. Show all posts
Showing posts with label clustering methods. Show all posts

Monday, September 30, 2024

Hierarchical Clustering Explained: How It Works


Hierarchical Clustering Explained – Complete Interactive Guide

๐ŸŒณ Hierarchical Clustering: A Complete Interactive Guide

๐Ÿ“‘ Table of Contents


๐Ÿš€ Introduction

Humans naturally group things. You see animals—you categorize them: flying, swimming, walking. Machines do something very similar using clustering algorithms.

๐Ÿ’ก Core Idea: Hierarchical clustering builds a tree of relationships between data points.

๐Ÿง  What is Hierarchical Clustering?

Hierarchical clustering is a method of grouping data into clusters where each cluster is nested within another. It creates a tree-like structure showing relationships between data points.

Instead of giving a fixed number of clusters upfront, it allows you to explore multiple grouping possibilities.


๐Ÿ”€ Types of Hierarchical Clustering

1. Agglomerative (Bottom-Up)

  • Start with individual points
  • Merge closest clusters step-by-step

2. Divisive (Top-Down)

  • Start with one cluster
  • Split repeatedly
๐Ÿ’ก Most real-world use cases rely on agglomerative clustering.

⚙️ Step-by-Step Process

  1. Calculate distance between all points
  2. Merge closest points
  3. Recalculate cluster distances
  4. Repeat until one cluster remains
๐Ÿ“– Expand Intuition

Think of this like forming friend groups. Initially, everyone is alone. Gradually, closest people form small groups, and those groups merge into bigger ones.


๐Ÿ“ Mathematical Explanation

Euclidean Distance Formula

Distance = √((x2 - x1)² + (y2 - y1)²)

Example:

Point A (1,2), Point B (4,6)

Distance = √((4-1)² + (6-2)²)
         = √(9 + 16)
         = √25
         = 5
๐Ÿ“˜ Why This Matters

Distance determines similarity. Smaller distance = more similar points. This directly affects how clusters are formed.


๐Ÿ“ Mathematical Foundations of Hierarchical Clustering

To truly understand hierarchical clustering, we need to look at the mathematics behind how similarity between data points is measured and how clusters are formed.

1. Distance Metrics

The most commonly used distance metric is Euclidean Distance:

d(x, y) = √(ฮฃ (xi - yi)²)

Where:

  • xi = coordinate of point x
  • yi = coordinate of point y

Example:

Point A = (1, 2)
Point B = (4, 6)

d(A,B) = √((4-1)² + (6-2)²)
       = √(9 + 16)
       = √25
       = 5
๐Ÿ“– Why Distance Matters

Distance determines similarity. Smaller distance → higher similarity. This directly controls which clusters merge first.


2. Distance Between Clusters

Once clusters are formed, we calculate distances between clusters using linkage methods:

Single Linkage (Nearest Neighbor)

d(A, B) = min(distance between any point in A and B)

Complete Linkage (Farthest Neighbor)

d(A, B) = max(distance between any point in A and B)

Average Linkage

d(A, B) = (1 / |A||B|) ฮฃฮฃ d(a, b)
๐Ÿ“Š Interpretation

Single linkage can create long chains. Complete linkage creates compact clusters. Average linkage balances both approaches.


3. Cluster Merge Criterion

At each step, hierarchical clustering selects two clusters that minimize the distance:

(A, B) = argmin d(A, B)

This greedy strategy ensures the closest clusters are merged first.


4. Dendrogram Height Meaning

The height at which two clusters merge represents the distance between them:

Height ∝ Dissimilarity

Larger height → clusters are very different Smaller height → clusters are very similar

๐Ÿ’ก Key Insight: Cutting the dendrogram at a certain height determines the final number of clusters.


๐Ÿ”— Linkage Methods

  • Single Linkage: Closest distance between clusters
  • Complete Linkage: Farthest distance
  • Average Linkage: Mean distance
๐Ÿ“Š Expand Comparison

Single linkage can create chain-like clusters. Complete linkage creates compact clusters. Average linkage balances both.


๐ŸŒณ Understanding Dendrogram

A dendrogram is a tree diagram showing how clusters merge.

  • Bottom → individual points
  • Top → one big cluster
  • Height → distance of merging
๐Ÿ’ก Cutting the dendrogram at different heights gives different cluster counts.

๐Ÿ’ป Code Example

from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3)
model.fit(data)

print(model.labels_)

๐Ÿ–ฅ CLI Output Sample

Cluster Labels:
[0, 0, 1, 1, 2, 2]

Cluster 0 → Similar small animals
Cluster 1 → Medium animals
Cluster 2 → Large animals
๐Ÿ“‚ Expand CLI Explanation

Each number represents a cluster assignment. Points with the same label belong to the same group.


๐ŸŒ Applications

  • Customer Segmentation
  • Gene Analysis
  • Document Clustering
  • Market Research

⚖️ Pros & Cons

Advantages

  • No need to predefine clusters
  • Flexible distance metrics
  • Easy visualization

Disadvantages

  • Slow for large datasets
  • Sensitive to noise
  • Cannot undo merges

๐ŸŽฏ Key Takeaways

  • Builds a hierarchy of clusters
  • Uses distance to measure similarity
  • Dendrogram helps visualize structure
  • Flexible but computationally expensive

๐Ÿ“Œ Final Thoughts

Hierarchical clustering is like building a family tree of data. It helps you understand relationships step-by-step rather than forcing a rigid grouping.

If you're exploring data and want flexibility with strong visual interpretation, this method is incredibly powerful.

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts