Thursday, September 19, 2024

How Hamming Distance Works in k-Nearest Neighbors Algorithms

Hamming Distance in k-NN | Deep Learning Tutorial

Understanding Hamming Distance in k-Nearest Neighbors (k-NN)

Table of Contents

What is Hamming Distance?

Hamming distance measures the number of differences between two strings of equal length. If two strings have the same characters in the same positions, their Hamming distance is zero. If every character differs, the distance equals the string length.

๐Ÿ’ก Key Point: Hamming distance is ideal for binary (0/1) and categorical data since it counts exact mismatches instead of computing geometric distances.

Why Use Hamming Distance?

In machine learning:

  • Binary or categorical features cannot be meaningfully measured with Euclidean distance.
  • Hamming distance provides a simple way to quantify dissimilarity between such features.
  • It allows classification algorithms like k-NN to work with non-numeric data effectively.

How Hamming Distance Works

Consider two binary strings:

String A: 101010
String B: 100100

Step-by-step comparison:

  • Position 1: 1 vs 1 → same
  • Position 2: 0 vs 0 → same
  • Position 3: 1 vs 0 → different
  • Position 4: 0 vs 1 → different
  • Position 5: 1 vs 0 → different
  • Position 6: 0 vs 0 → same

Total differences = 3 → Hamming distance = 3

Tip: You can visualize Hamming distance by marking differences with X:
101010
100100
  X  X X

Comparison with Other Distance Metrics

Different metrics suit different types of data:

MetricUse CaseNotes
Hamming DistanceBinary/CategoricalCounts differing positions exactly
Euclidean DistanceContinuous numericGeometric straight-line distance
Manhattan DistanceNumeric, grid-likeSum of absolute differences

Hamming Distance in k-Nearest Neighbors (k-NN)

The k-NN algorithm classifies a new point by comparing it to existing data points. Hamming distance identifies which points are most similar.

Step 1: Define the Data

Represent each object as a binary string, e.g., "101010". Each string encodes features such as Yes/No flags or categorical attributes.

Step 2: Calculate Distances

Compute Hamming distance from the new point to each dataset point.

Step 3: Find Nearest Neighbors

Sort all distances. Select the k smallest values to identify the nearest neighbors.

Step 4: Classify

Assign the class of the new point based on majority vote among the nearest neighbors.

Example: Classifying a New Point

Dataset:

Data PointClass
1100A
1010B
1111A
1001B

New point: 1011

Hamming distances:

  • To 1100: 2
  • To 1010: 1
  • To 1111: 3
  • To 1001: 2

If k=2, nearest neighbors are "1010" (Class B) and "1100" (Class A). Since "1010" is closest, the new point is classified as Class B.

Code Example & CLI Simulation

Python implementation:

def hamming_distance(a, b):
    """Compute Hamming distance between two equal-length strings."""
    return sum(el1 != el2 for el1, el2 in zip(a, b))

data = [("1100", "A"), ("1010", "B"), ("1111", "A"), ("1001", "B")]
new_point = "1011"

# Compute distances
distances = [(point, cls, hamming_distance(new_point, point)) for point, cls in data]
distances.sort(key=lambda x: x[2])

# Choose k nearest neighbors
k = 2
nearest_neighbors = distances[:k]
print("Nearest Neighbors:", nearest_neighbors)

# Majority vote
from collections import Counter
vote = Counter(cls for _, cls, _ in nearest_neighbors)
classification = vote.most_common(1)[0][0]
print("Classified as:", classification)

CLI Output Simulation:

Nearest Neighbors: [('1010', 'B', 1), ('1100', 'A', 2)]
Classified as: B

Key Takeaways

  • Hamming distance counts differing positions between equal-length strings.
  • It is ideal for binary or categorical data, unlike Euclidean or Manhattan distance.
  • In k-NN, it identifies the nearest neighbors to classify new points effectively.
  • Majority voting among neighbors determines the predicted class.
  • Understanding distance metrics is critical for selecting the right algorithm for your data type.

No comments:

Post a Comment

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts