Showing posts with label Text Similarity. Show all posts

Sunday, March 16, 2025

The Levenshtein Algorithm: How Computers Fix Your Typos

Levenshtein Algorithm Explained – Beginner to Advanced Guide

✏️ Levenshtein Algorithm – Complete Beginner Friendly Guide

Have you ever typed something wrong and still got the correct suggestion? That’s the magic of the Levenshtein Algorithm.

This guide will take you from basic understanding → math → implementation → real-world usage in a very simple and practical way.

🧠 What is the Levenshtein Algorithm?

The Levenshtein algorithm measures how different two words are.

👉 It counts the minimum number of edits needed to convert one word into another.

Allowed operations:

Insert a character
Delete a character
Replace a character

📌 Simple Example

"kitten" → "sitting"

k → s (replace)
e → i (replace)
add g (insert)

Distance = 3

📐 Mathematical Explanation (Easy)

The Levenshtein distance is calculated using this formula:

\[ D(i, j) = \begin{cases} i & \text{if } j = 0 \\ j & \text{if } i = 0 \\ \min \begin{cases} D(i-1, j) + 1 \\ D(i, j-1) + 1 \\ D(i-1, j-1) + cost \end{cases} \end{cases} \]

Simple Meaning:

If one word is empty → distance = length of other word
Otherwise → take minimum of:
- Delete
- Insert
- Replace

💡 cost = 0 if letters match, otherwise 1

📊 Matrix Method Explained

We use a table (matrix) to compute distances step-by-step.

		c	u	t
	0	1	2	3
c	1	0	1	2
a	2	1	1	2
t	3	2	2	1

Final answer is bottom-right cell → 1

💻 Code Example (Python)


def levenshtein(a, b):
    dp = [[0]*(len(b)+1) for _ in range(len(a)+1)]

```
for i in range(len(a)+1):
    dp[i][0] = i
for j in range(len(b)+1):
    dp[0][j] = j

for i in range(1, len(a)+1):
    for j in range(1, len(b)+1):
        cost = 0 if a[i-1] == b[j-1] else 1
        dp[i][j] = min(
            dp[i-1][j] + 1,
            dp[i][j-1] + 1,
            dp[i-1][j-1] + cost
        )
return dp[-1][-1]
```

print(levenshtein("kitten", "sitting"))

🖥️ CLI Output

Click to View Output

Input: kitten, sitting
Output: 3

🌍 Real-World Applications

Spell Check – Suggest correct words
Search Engines – Handle typos
DNA Analysis – Compare sequences
Plagiarism Detection – Find similarity

⚠️ Limitations

Does not consider typo frequency
All edits treated equally
Slow for large datasets

💡 Key Takeaways

Measures difference between strings
Uses insert, delete, replace operations
Based on dynamic programming
Widely used in real-world systems

🎯 Final Thoughts

The Levenshtein algorithm is simple but incredibly powerful. It helps machines understand human errors and fix them intelligently.

From autocorrect to search engines—it plays a key role in making technology more user-friendly.

Monday, December 2, 2024

Matching Products to Advertisements Using Machine Learning

Matching Advertisements to Products Using Machine Learning

🧠 Matching Advertisements to Products Using Machine Learning

📑 Table of Contents

Introduction
Understanding the Problem
Challenges
Solution Overview
Data Preprocessing
Similarity Measurement
ML Pipeline
CLI Output & Examples
Common Issues & Fixes
Key Takeaways
Related Articles

🚀 Introduction

Imagine handling thousands of product listings and advertisements daily. The ability to automatically match ads to the correct products is not just convenient—it’s essential for modern digital platforms.

💡 Goal: Match each advertisement to the most relevant product—or return no match if similarity is low.

🔍 Understanding the Problem

Inputs

Product Data (title, description, images)
Advertisement Data (title, description, images)

Outputs

(ad_id, product_id)
(ad_id, None)

📖 Expand Explanation

If no product crosses the similarity threshold, we explicitly return None to avoid incorrect mapping.

⚠️ Challenges

Data inconsistency – informal vs formal descriptions
Multimodal data – text + images
Scalability – large datasets
Threshold tuning – subjective similarity cutoffs

💡 Real-world systems fail more due to bad thresholds than bad models.

🧩 Solution Overview

Preprocess data
Compute similarity
Match and filter

🛠 Data Preprocessing

Text Processing

Tokenization
Lowercasing
Stopword removal
Lemmatization

Embedding Techniques

TF-IDF
Word2Vec
BERT / Sentence-BERT

Image Processing

Use CNN models (ResNet, EfficientNet)
Extract feature vectors

📐 Similarity Measurement

Cosine Similarity Formula

cos(θ) = (A · B) / (||A|| ||B||)

This measures how similar two vectors are based on angle rather than magnitude.

Euclidean Distance

d = √(Σ (xi - yi)^2)

Multimodal Combination

Final Score = w1 * text_similarity + w2 * image_similarity

📖 Why Combine Modalities?

Text alone may miss visual similarity. Images alone may miss context. Together, they give better accuracy.

⚙️ Machine Learning Pipeline

Feature extraction (text + image)
Embedding storage
Similarity computation
Ranking + threshold filtering

💡 Precompute embeddings to improve speed drastically.

💻 Code Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

ad_embedding = model.encode(ad_text)
product_embedding = model.encode(product_text)

score = cosine_similarity([ad_embedding], [product_embedding])
print(score)

🖥 CLI Output Example

Processing Ads...
Embedding Generated ✔
Calculating Similarity...

Ad 101 → Product 55 (Score: 0.87)
Ad 102 → None (Score: 0.32)

📂 Expand CLI Explanation

Scores above threshold (e.g., 0.75) are accepted. Others are rejected to avoid false matches.

🚧 Common Issues & Solutions

1. Data Imbalance

Use precision/recall instead of accuracy.

2. Noisy Data

Apply spell correction and filtering.

3. Performance

Use FAISS for fast nearest neighbor search.

4. Threshold Problems

Use validation data for tuning.

🎯 Key Takeaways

Multimodal learning improves accuracy
Embeddings are the foundation
Threshold tuning is critical
Scalability requires smart indexing

📌 Final Thoughts

Matching advertisements to products is not just a machine learning task—it’s a system design challenge. The best solutions combine strong modeling, efficient computation, and continuous evaluation.

If implemented correctly, this approach can significantly improve automation, reduce manual effort, and enhance user experience in any data-driven platform.

Pages

Sunday, March 16, 2025

✏️ Levenshtein Algorithm – Complete Beginner Friendly Guide

📚 Table of Contents

🧠 What is the Levenshtein Algorithm?

Allowed operations:

📌 Simple Example

📐 Mathematical Explanation (Easy)

Simple Meaning:

📊 Matrix Method Explained

💻 Code Example (Python)

🖥️ CLI Output

🌍 Real-World Applications

⚠️ Limitations

💡 Key Takeaways

🎯 Final Thoughts

Monday, December 2, 2024

🧠 Matching Advertisements to Products Using Machine Learning

📑 Table of Contents

🚀 Introduction

🔍 Understanding the Problem

Inputs

Outputs

⚠️ Challenges

🧩 Solution Overview

🛠 Data Preprocessing

Text Processing

Embedding Techniques

Image Processing

📐 Similarity Measurement

Cosine Similarity Formula

Euclidean Distance

Multimodal Combination

⚙️ Machine Learning Pipeline

💻 Code Example

🖥 CLI Output Example

🚧 Common Issues & Solutions

1. Data Imbalance

2. Noisy Data

3. Performance

4. Threshold Problems

🎯 Key Takeaways

📌 Final Thoughts

Featured Post

Popular Posts

🧠 AI Quiz

🎯 Guess Game

⚡ Speed Test

✊ Rock Paper Scissors

🔢 Quick Math

🧩 Memory Game

⌨️ Typing Speed

🟥 Color Click

🎲 Dice Game

Latest Posts

AI Category

🚀 Trending AI Projects

📊 Data Science Resources

📚 Latest Research Papers

🔥 New AI Tools

💬 Developer Discussions

Contact Form

Followers