Showing posts with label Text Similarity. Show all posts
Showing posts with label Text Similarity. Show all posts

Sunday, March 16, 2025

The Levenshtein Algorithm: How Computers Fix Your Typos




Levenshtein Algorithm Explained – Beginner to Advanced Guide

✏️ Levenshtein Algorithm – Complete Beginner Friendly Guide

Have you ever typed something wrong and still got the correct suggestion? That’s the magic of the Levenshtein Algorithm.

This guide will take you from basic understanding → math → implementation → real-world usage in a very simple and practical way.


๐Ÿ“š Table of Contents


๐Ÿง  What is the Levenshtein Algorithm?

The Levenshtein algorithm measures how different two words are.

๐Ÿ‘‰ It counts the minimum number of edits needed to convert one word into another.

Allowed operations:

  • Insert a character
  • Delete a character
  • Replace a character

๐Ÿ“Œ Simple Example

"kitten" → "sitting"

  • k → s (replace)
  • e → i (replace)
  • add g (insert)

Distance = 3


๐Ÿ“ Mathematical Explanation (Easy)

The Levenshtein distance is calculated using this formula:

\[ D(i, j) = \begin{cases} i & \text{if } j = 0 \\ j & \text{if } i = 0 \\ \min \begin{cases} D(i-1, j) + 1 \\ D(i, j-1) + 1 \\ D(i-1, j-1) + cost \end{cases} \end{cases} \]

Simple Meaning:

  • If one word is empty → distance = length of other word
  • Otherwise → take minimum of:
    • Delete
    • Insert
    • Replace
๐Ÿ’ก cost = 0 if letters match, otherwise 1

๐Ÿ“Š Matrix Method Explained

We use a table (matrix) to compute distances step-by-step.

cut
0123
c1012
a2112
t3221

Final answer is bottom-right cell → 1


๐Ÿ’ป Code Example (Python)

def levenshtein(a, b): dp = [[0]*(len(b)+1) for _ in range(len(a)+1)] ``` for i in range(len(a)+1): dp[i][0] = i for j in range(len(b)+1): dp[0][j] = j for i in range(1, len(a)+1): for j in range(1, len(b)+1): cost = 0 if a[i-1] == b[j-1] else 1 dp[i][j] = min( dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + cost ) return dp[-1][-1] ``` print(levenshtein("kitten", "sitting"))

๐Ÿ–ฅ️ CLI Output

Click to View Output
Input: kitten, sitting
Output: 3

๐ŸŒ Real-World Applications

  • Spell Check – Suggest correct words
  • Search Engines – Handle typos
  • DNA Analysis – Compare sequences
  • Plagiarism Detection – Find similarity

⚠️ Limitations

  • Does not consider typo frequency
  • All edits treated equally
  • Slow for large datasets

๐Ÿ’ก Key Takeaways

  • Measures difference between strings
  • Uses insert, delete, replace operations
  • Based on dynamic programming
  • Widely used in real-world systems

๐ŸŽฏ Final Thoughts

The Levenshtein algorithm is simple but incredibly powerful. It helps machines understand human errors and fix them intelligently.

From autocorrect to search engines—it plays a key role in making technology more user-friendly.

Monday, December 2, 2024

Matching Products to Advertisements Using Machine Learning


Matching Advertisements to Products Using Machine Learning

๐Ÿง  Matching Advertisements to Products Using Machine Learning

๐Ÿ“‘ Table of Contents


๐Ÿš€ Introduction

Imagine handling thousands of product listings and advertisements daily. The ability to automatically match ads to the correct products is not just convenient—it’s essential for modern digital platforms.

๐Ÿ’ก Goal: Match each advertisement to the most relevant product—or return no match if similarity is low.

๐Ÿ” Understanding the Problem

Inputs

  • Product Data (title, description, images)
  • Advertisement Data (title, description, images)

Outputs

(ad_id, product_id)
(ad_id, None)
๐Ÿ“– Expand Explanation

If no product crosses the similarity threshold, we explicitly return None to avoid incorrect mapping.


⚠️ Challenges

  • Data inconsistency – informal vs formal descriptions
  • Multimodal data – text + images
  • Scalability – large datasets
  • Threshold tuning – subjective similarity cutoffs
๐Ÿ’ก Real-world systems fail more due to bad thresholds than bad models.

๐Ÿงฉ Solution Overview

  1. Preprocess data
  2. Compute similarity
  3. Match and filter

๐Ÿ›  Data Preprocessing

Text Processing

  • Tokenization
  • Lowercasing
  • Stopword removal
  • Lemmatization

Embedding Techniques

  • TF-IDF
  • Word2Vec
  • BERT / Sentence-BERT

Image Processing

  • Use CNN models (ResNet, EfficientNet)
  • Extract feature vectors

๐Ÿ“ Similarity Measurement

Cosine Similarity Formula

cos(ฮธ) = (A · B) / (||A|| ||B||)

This measures how similar two vectors are based on angle rather than magnitude.

Euclidean Distance

d = √(ฮฃ (xi - yi)^2)

Multimodal Combination

Final Score = w1 * text_similarity + w2 * image_similarity
๐Ÿ“– Why Combine Modalities?

Text alone may miss visual similarity. Images alone may miss context. Together, they give better accuracy.


⚙️ Machine Learning Pipeline

  1. Feature extraction (text + image)
  2. Embedding storage
  3. Similarity computation
  4. Ranking + threshold filtering
๐Ÿ’ก Precompute embeddings to improve speed drastically.

๐Ÿ’ป Code Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

ad_embedding = model.encode(ad_text)
product_embedding = model.encode(product_text)

score = cosine_similarity([ad_embedding], [product_embedding])
print(score)

๐Ÿ–ฅ CLI Output Example

Processing Ads...
Embedding Generated ✔
Calculating Similarity...

Ad 101 → Product 55 (Score: 0.87)
Ad 102 → None (Score: 0.32)
๐Ÿ“‚ Expand CLI Explanation

Scores above threshold (e.g., 0.75) are accepted. Others are rejected to avoid false matches.


๐Ÿšง Common Issues & Solutions

1. Data Imbalance

Use precision/recall instead of accuracy.

2. Noisy Data

Apply spell correction and filtering.

3. Performance

Use FAISS for fast nearest neighbor search.

4. Threshold Problems

Use validation data for tuning.


๐ŸŽฏ Key Takeaways

  • Multimodal learning improves accuracy
  • Embeddings are the foundation
  • Threshold tuning is critical
  • Scalability requires smart indexing

๐Ÿ“Œ Final Thoughts

Matching advertisements to products is not just a machine learning task—it’s a system design challenge. The best solutions combine strong modeling, efficient computation, and continuous evaluation.

If implemented correctly, this approach can significantly improve automation, reduce manual effort, and enhance user experience in any data-driven platform.

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts