Understanding Hamming Distance in k-Nearest Neighbors (k-NN)
Table of Contents
- What is Hamming Distance?
- Why Use Hamming Distance?
- How Hamming Distance Works
- Comparison with Other Distance Metrics
- Hamming Distance in k-NN
- Example: Classifying a New Point
- Code Example & CLI Simulation
- Key Takeaways
- Related Articles
What is Hamming Distance?
Hamming distance measures the number of differences between two strings of equal length. If two strings have the same characters in the same positions, their Hamming distance is zero. If every character differs, the distance equals the string length.
Why Use Hamming Distance?
In machine learning:
- Binary or categorical features cannot be meaningfully measured with Euclidean distance.
- Hamming distance provides a simple way to quantify dissimilarity between such features.
- It allows classification algorithms like k-NN to work with non-numeric data effectively.
How Hamming Distance Works
Consider two binary strings:
String A: 101010 String B: 100100
Step-by-step comparison:
- Position 1: 1 vs 1 → same
- Position 2: 0 vs 0 → same
- Position 3: 1 vs 0 → different
- Position 4: 0 vs 1 → different
- Position 5: 1 vs 0 → different
- Position 6: 0 vs 0 → same
Total differences = 3 → Hamming distance = 3
X:
101010 100100 X X X
Comparison with Other Distance Metrics
Different metrics suit different types of data:
| Metric | Use Case | Notes |
|---|---|---|
| Hamming Distance | Binary/Categorical | Counts differing positions exactly |
| Euclidean Distance | Continuous numeric | Geometric straight-line distance |
| Manhattan Distance | Numeric, grid-like | Sum of absolute differences |
Hamming Distance in k-Nearest Neighbors (k-NN)
The k-NN algorithm classifies a new point by comparing it to existing data points. Hamming distance identifies which points are most similar.
Represent each object as a binary string, e.g., "101010". Each string encodes features such as Yes/No flags or categorical attributes.
Compute Hamming distance from the new point to each dataset point.
Sort all distances. Select the k smallest values to identify the nearest neighbors.
Assign the class of the new point based on majority vote among the nearest neighbors.
Example: Classifying a New Point
Dataset:
| Data Point | Class |
|---|---|
| 1100 | A |
| 1010 | B |
| 1111 | A |
| 1001 | B |
New point: 1011
Hamming distances:
- To 1100: 2
- To 1010: 1
- To 1111: 3
- To 1001: 2
If k=2, nearest neighbors are "1010" (Class B) and "1100" (Class A). Since "1010" is closest, the new point is classified as Class B.
Code Example & CLI Simulation
Python implementation:
def hamming_distance(a, b):
"""Compute Hamming distance between two equal-length strings."""
return sum(el1 != el2 for el1, el2 in zip(a, b))
data = [("1100", "A"), ("1010", "B"), ("1111", "A"), ("1001", "B")]
new_point = "1011"
# Compute distances
distances = [(point, cls, hamming_distance(new_point, point)) for point, cls in data]
distances.sort(key=lambda x: x[2])
# Choose k nearest neighbors
k = 2
nearest_neighbors = distances[:k]
print("Nearest Neighbors:", nearest_neighbors)
# Majority vote
from collections import Counter
vote = Counter(cls for _, cls, _ in nearest_neighbors)
classification = vote.most_common(1)[0][0]
print("Classified as:", classification)
CLI Output Simulation:
Nearest Neighbors: [('1010', 'B', 1), ('1100', 'A', 2)]
Classified as: B
Key Takeaways
- Hamming distance counts differing positions between equal-length strings.
- It is ideal for binary or categorical data, unlike Euclidean or Manhattan distance.
- In k-NN, it identifies the nearest neighbors to classify new points effectively.
- Majority voting among neighbors determines the predicted class.
- Understanding distance metrics is critical for selecting the right algorithm for your data type.
No comments:
Post a Comment