Showing posts with label dendrogram. Show all posts
Showing posts with label dendrogram. Show all posts

Monday, September 30, 2024

Hierarchical Clustering Explained: How It Works


Hierarchical Clustering Explained – Complete Interactive Guide

๐ŸŒณ Hierarchical Clustering: A Complete Interactive Guide

๐Ÿ“‘ Table of Contents


๐Ÿš€ Introduction

Humans naturally group things. You see animals—you categorize them: flying, swimming, walking. Machines do something very similar using clustering algorithms.

๐Ÿ’ก Core Idea: Hierarchical clustering builds a tree of relationships between data points.

๐Ÿง  What is Hierarchical Clustering?

Hierarchical clustering is a method of grouping data into clusters where each cluster is nested within another. It creates a tree-like structure showing relationships between data points.

Instead of giving a fixed number of clusters upfront, it allows you to explore multiple grouping possibilities.


๐Ÿ”€ Types of Hierarchical Clustering

1. Agglomerative (Bottom-Up)

  • Start with individual points
  • Merge closest clusters step-by-step

2. Divisive (Top-Down)

  • Start with one cluster
  • Split repeatedly
๐Ÿ’ก Most real-world use cases rely on agglomerative clustering.

⚙️ Step-by-Step Process

  1. Calculate distance between all points
  2. Merge closest points
  3. Recalculate cluster distances
  4. Repeat until one cluster remains
๐Ÿ“– Expand Intuition

Think of this like forming friend groups. Initially, everyone is alone. Gradually, closest people form small groups, and those groups merge into bigger ones.


๐Ÿ“ Mathematical Explanation

Euclidean Distance Formula

Distance = √((x2 - x1)² + (y2 - y1)²)

Example:

Point A (1,2), Point B (4,6)

Distance = √((4-1)² + (6-2)²)
         = √(9 + 16)
         = √25
         = 5
๐Ÿ“˜ Why This Matters

Distance determines similarity. Smaller distance = more similar points. This directly affects how clusters are formed.


๐Ÿ“ Mathematical Foundations of Hierarchical Clustering

To truly understand hierarchical clustering, we need to look at the mathematics behind how similarity between data points is measured and how clusters are formed.

1. Distance Metrics

The most commonly used distance metric is Euclidean Distance:

d(x, y) = √(ฮฃ (xi - yi)²)

Where:

  • xi = coordinate of point x
  • yi = coordinate of point y

Example:

Point A = (1, 2)
Point B = (4, 6)

d(A,B) = √((4-1)² + (6-2)²)
       = √(9 + 16)
       = √25
       = 5
๐Ÿ“– Why Distance Matters

Distance determines similarity. Smaller distance → higher similarity. This directly controls which clusters merge first.


2. Distance Between Clusters

Once clusters are formed, we calculate distances between clusters using linkage methods:

Single Linkage (Nearest Neighbor)

d(A, B) = min(distance between any point in A and B)

Complete Linkage (Farthest Neighbor)

d(A, B) = max(distance between any point in A and B)

Average Linkage

d(A, B) = (1 / |A||B|) ฮฃฮฃ d(a, b)
๐Ÿ“Š Interpretation

Single linkage can create long chains. Complete linkage creates compact clusters. Average linkage balances both approaches.


3. Cluster Merge Criterion

At each step, hierarchical clustering selects two clusters that minimize the distance:

(A, B) = argmin d(A, B)

This greedy strategy ensures the closest clusters are merged first.


4. Dendrogram Height Meaning

The height at which two clusters merge represents the distance between them:

Height ∝ Dissimilarity

Larger height → clusters are very different Smaller height → clusters are very similar

๐Ÿ’ก Key Insight: Cutting the dendrogram at a certain height determines the final number of clusters.


๐Ÿ”— Linkage Methods

  • Single Linkage: Closest distance between clusters
  • Complete Linkage: Farthest distance
  • Average Linkage: Mean distance
๐Ÿ“Š Expand Comparison

Single linkage can create chain-like clusters. Complete linkage creates compact clusters. Average linkage balances both.


๐ŸŒณ Understanding Dendrogram

A dendrogram is a tree diagram showing how clusters merge.

  • Bottom → individual points
  • Top → one big cluster
  • Height → distance of merging
๐Ÿ’ก Cutting the dendrogram at different heights gives different cluster counts.

๐Ÿ’ป Code Example

from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3)
model.fit(data)

print(model.labels_)

๐Ÿ–ฅ CLI Output Sample

Cluster Labels:
[0, 0, 1, 1, 2, 2]

Cluster 0 → Similar small animals
Cluster 1 → Medium animals
Cluster 2 → Large animals
๐Ÿ“‚ Expand CLI Explanation

Each number represents a cluster assignment. Points with the same label belong to the same group.


๐ŸŒ Applications

  • Customer Segmentation
  • Gene Analysis
  • Document Clustering
  • Market Research

⚖️ Pros & Cons

Advantages

  • No need to predefine clusters
  • Flexible distance metrics
  • Easy visualization

Disadvantages

  • Slow for large datasets
  • Sensitive to noise
  • Cannot undo merges

๐ŸŽฏ Key Takeaways

  • Builds a hierarchy of clusters
  • Uses distance to measure similarity
  • Dendrogram helps visualize structure
  • Flexible but computationally expensive

๐Ÿ“Œ Final Thoughts

Hierarchical clustering is like building a family tree of data. It helps you understand relationships step-by-step rather than forcing a rigid grouping.

If you're exploring data and want flexibility with strong visual interpretation, this method is incredibly powerful.

A Beginner's Guide to Dendrograms: Visualizing Data Clustering


If you’ve ever tried to organize a messy room, grouping similar things together—like books, clothes, or toys—you’ve already got a basic sense of what clustering is. In the world of data science, we do something similar, but instead of books and clothes, we’re grouping data points. One tool that helps us visualize this process is called a **dendrogram**. 

Don’t worry—it’s not as complicated as it sounds! By the end of this post, you’ll not only understand what a dendrogram is but also how it helps in clustering, using a simple example.

### What Is a Dendrogram?

A dendrogram is a **tree-like diagram** that shows the relationships between different items in a dataset. Picture it as a family tree, but instead of showing how people are related, it shows how data points are similar or different.

At the bottom of the dendrogram, each item (or data point) starts out as its own individual “leaf.” As we move up, the leaves start merging into branches based on their similarity. The closer two data points are, the earlier (or lower) they get grouped together on the tree.

### Why Do We Use Dendrograms in Clustering?

Clustering is all about grouping data points that are similar to each other. A dendrogram is a way to visualize this process. It helps us see which data points are grouped together and how far apart different groups are. Essentially, it tells the story of how we could split or group our data at different levels.

This type of clustering is called **hierarchical clustering**, and the dendrogram is its visual representation.

### Simple Example: Grouping Books

Let’s imagine we have five books, and we want to group them based on their genres:

- Book 1: Fiction
- Book 2: Fiction
- Book 3: History
- Book 4: History
- Book 5: Science

Here’s how we can think about clustering them and how a dendrogram would help.

1. **Step 1: Start with individual books.**
   At the beginning, every book is on its own. This would be the bottom of the dendrogram. So, we have five separate leaves, one for each book.

2. **Step 2: Group similar books first.**
   The next step is to find the most similar books. In this case, we can first group the two fiction books (Book 1 and Book 2) because they are the same genre. Similarly, we can group the two history books (Book 3 and Book 4). Now, we have three groups: two fiction books, two history books, and one science book (Book 5 is still by itself).

3. **Step 3: Combine larger groups.**
   Now we can start looking at the bigger picture. The two history books are closer to the science book than to the fiction books (because history and science are both non-fiction). So, we can group the history books and the science book next.

4. **Step 4: Final grouping.**
   Finally, we combine everything together. The fiction books get merged with the non-fiction group at the very top of the dendrogram.

The final dendrogram will look like a tree where the fiction books are on one branch, and the history and science books are on another. The height of the branches shows how similar or different the books are. Books that are closer in genre (fiction vs. non-fiction) will be joined at lower points in the dendrogram.

### How to Read a Dendrogram

The height at which two data points (or clusters) merge in a dendrogram tells us how similar they are. A low height means they are very similar, while a higher height means they are more different. 

In our book example:
- The two fiction books are grouped at a very low level (because they are almost identical).
- The history and science books are grouped higher up, showing they are less similar than the fiction books are to each other but more similar to each other than to the fiction books.
  
This is helpful when you’re trying to figure out how many clusters or groups you want to keep. If you only care about the biggest differences, you might cut the dendrogram at a higher level, resulting in just two groups: fiction and non-fiction. If you care about more detail, you might cut it lower and get three groups: fiction, history, and science.

### Benefits of Using a Dendrogram

1. **Visual clarity:** Dendrograms give you a clear picture of how your data is organized and how similar different items are.
2. **Flexibility:** You can decide how many clusters you want by cutting the dendrogram at different levels. This is great because you can adjust based on how detailed you want your clustering to be.
3. **Hierarchy:** Dendrograms help you understand the relationships between clusters. For example, you can see if two small clusters should be combined into a larger one.

### Final Thoughts

Dendrograms might seem intimidating at first, but they’re essentially just a visual way of showing how things are grouped together based on similarity. Whether you’re organizing books or analyzing complex datasets, they can be a powerful tool for making sense of the data.

Next time you’re faced with a clustering problem, think of a dendrogram as your guide, showing you how different data points come together and helping you decide the best way to group them. It’s like building a family tree for your data!

### **Efficiently Extracting Dendrogram Ordering from Linkage Matrix**  

Dendrograms are an essential tool in **hierarchical clustering**, visually representing how data points are merged into clusters. In Python, the `dendrogram()` function from `scipy.cluster.hierarchy` is often used to extract the ordering of points after clustering. However, as datasets grow larger, using `dendrogram()` can become computationally expensive and even lead to errors.  



### **Understanding the Linkage Matrix**  
Let's start by generating a **random dataset** with five points and performing hierarchical clustering:


import numpy as np
from scipy.cluster.hierarchy import linkage

# Generate 5 random points with 3 features each
x = np.random.rand(5, 3)

# Compute the pairwise distances and linkage matrix
Z = linkage(x, method='single')
print(Z)


The **linkage matrix (Z)** stores the hierarchical clustering information in four columns:
1. **Index of the first merged cluster**
2. **Index of the second merged cluster**
3. **Distance between the merged clusters**
4. **Number of points in the new cluster**

The last row corresponds to the final merging step that joins all points into one cluster.

---

### **Extracting Dendrogram Ordering Without `dendrogram()`**  
We need to **traverse the linkage matrix** and **recover the order of points** as they appear in the final hierarchical clustering.


def get_dendrogram_order(Z, num_points):
    clusters = {i: [i] for i in range(num_points)} # Initialize clusters
    
    for row in Z:
        c1, c2 = int(row[0]), int(row[1]) # Extract cluster indices
        clusters[num_points] = clusters[c1] + clusters[c2] # Merge clusters
        del clusters[c1], clusters[c2] # Remove old clusters
        num_points += 1 # Increment cluster index
    
    return [str(i) for i in clusters[max(clusters)]]

# Number of initial points
num_points = 5

# Compute ordering from linkage matrix
ordering = get_dendrogram_order(Z, num_points)
print(ordering)


This function **recursively merges clusters** based on the linkage matrix and returns the final order of points. The output should match `dendrogram(Z)['ivl']`.

---

### **How It Works**  
1. We initialize **each point as its own cluster** in a dictionary.
2. We **iterate over the linkage matrix**, merging clusters step by step.
3. Each **new cluster** is stored in the dictionary with a new index.
4. Finally, we extract the **final cluster’s leaf order**.

This method **avoids calling `dendrogram()`**, making it faster and more efficient for large datasets.

---

### **Conclusion**  
If you're dealing with **large datasets** and need dendrogram ordering efficiently, this approach provides a scalable alternative to `dendrogram()`. It directly processes the **linkage matrix** to recover the **correct order** of points. By bypassing the plotting function, it improves **performance** and **avoids errors** in large-scale clustering problems.

### **Reordering a Dendrogram: Moving an Outlier Branch**  

Dendrograms are a fantastic way to visualize hierarchical clustering, but sometimes their default order does not match our expectations. A common challenge is dealing with **outlier branches**—clusters that appear on one side of the dendrogram but would make more sense if placed elsewhere.  

One way to address this is by manually **cutting and merging** branches, but this can be tedious. A better approach is using **reorder functions**, which allow for a more structured way of reordering the dendrogram without breaking its hierarchical structure.  

---

### **Understanding `reorder.dendrogram`**  

The function `reorder.dendrogram` is designed to **rearrange the leaves of a dendrogram** based on a given criterion. This could be anything from cluster size to a specific numeric value associated with each branch. The challenge is knowing **which parameters to set** to achieve the desired reordering.  

If you want to move an **outlier branch** to the other side, you need to define a reordering criterion that reflects its desired position.  

---

### **How to Reorder the Dendrogram**  

Here’s a structured way to think about the problem:  

1. **Identify the Outlier Branch**  
   - Which cluster appears in the "wrong" place?  
   - Does it have an unusually high or low distance compared to other branches?  

2. **Define the Reordering Criterion**  
   - Do you want to order by cluster size?  
   - Do you have a predefined sequence in mind?  
   - Should the reordering follow a particular numeric variable?  

3. **Apply Reordering**  
   - Instead of manually cutting and merging, apply the reorder function using the chosen criterion.  

4. **Verify the New Order**  
   - Once reordered, check if the outlier branch has moved to the expected location.  

---

### **Why Use `reorder.dendrogram` Instead of Manual Cutting & Merging?**  

- **Preserves the hierarchical structure** of the dendrogram.  
- **Less error-prone**, as manually cutting and reattaching branches can disrupt clustering relationships.  
- **More flexible**, allowing for different reordering strategies beyond just moving a single branch.  

---

### **Final Thoughts**  

If you’re struggling to get the exact order you want, consider experimenting with different reordering criteria. Sometimes, a minor tweak in how clusters are ranked can make a big difference. Would you like suggestions on alternative ways to visualize your clusters beyond a dendrogram? Let me know!

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts