Showing posts with label Cosine Similarity. Show all posts
Showing posts with label Cosine Similarity. Show all posts

Monday, December 2, 2024

Matching Products to Advertisements Using Machine Learning


Matching Advertisements to Products Using Machine Learning

๐Ÿง  Matching Advertisements to Products Using Machine Learning

๐Ÿ“‘ Table of Contents


๐Ÿš€ Introduction

Imagine handling thousands of product listings and advertisements daily. The ability to automatically match ads to the correct products is not just convenient—it’s essential for modern digital platforms.

๐Ÿ’ก Goal: Match each advertisement to the most relevant product—or return no match if similarity is low.

๐Ÿ” Understanding the Problem

Inputs

  • Product Data (title, description, images)
  • Advertisement Data (title, description, images)

Outputs

(ad_id, product_id)
(ad_id, None)
๐Ÿ“– Expand Explanation

If no product crosses the similarity threshold, we explicitly return None to avoid incorrect mapping.


⚠️ Challenges

  • Data inconsistency – informal vs formal descriptions
  • Multimodal data – text + images
  • Scalability – large datasets
  • Threshold tuning – subjective similarity cutoffs
๐Ÿ’ก Real-world systems fail more due to bad thresholds than bad models.

๐Ÿงฉ Solution Overview

  1. Preprocess data
  2. Compute similarity
  3. Match and filter

๐Ÿ›  Data Preprocessing

Text Processing

  • Tokenization
  • Lowercasing
  • Stopword removal
  • Lemmatization

Embedding Techniques

  • TF-IDF
  • Word2Vec
  • BERT / Sentence-BERT

Image Processing

  • Use CNN models (ResNet, EfficientNet)
  • Extract feature vectors

๐Ÿ“ Similarity Measurement

Cosine Similarity Formula

cos(ฮธ) = (A · B) / (||A|| ||B||)

This measures how similar two vectors are based on angle rather than magnitude.

Euclidean Distance

d = √(ฮฃ (xi - yi)^2)

Multimodal Combination

Final Score = w1 * text_similarity + w2 * image_similarity
๐Ÿ“– Why Combine Modalities?

Text alone may miss visual similarity. Images alone may miss context. Together, they give better accuracy.


⚙️ Machine Learning Pipeline

  1. Feature extraction (text + image)
  2. Embedding storage
  3. Similarity computation
  4. Ranking + threshold filtering
๐Ÿ’ก Precompute embeddings to improve speed drastically.

๐Ÿ’ป Code Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

ad_embedding = model.encode(ad_text)
product_embedding = model.encode(product_text)

score = cosine_similarity([ad_embedding], [product_embedding])
print(score)

๐Ÿ–ฅ CLI Output Example

Processing Ads...
Embedding Generated ✔
Calculating Similarity...

Ad 101 → Product 55 (Score: 0.87)
Ad 102 → None (Score: 0.32)
๐Ÿ“‚ Expand CLI Explanation

Scores above threshold (e.g., 0.75) are accepted. Others are rejected to avoid false matches.


๐Ÿšง Common Issues & Solutions

1. Data Imbalance

Use precision/recall instead of accuracy.

2. Noisy Data

Apply spell correction and filtering.

3. Performance

Use FAISS for fast nearest neighbor search.

4. Threshold Problems

Use validation data for tuning.


๐ŸŽฏ Key Takeaways

  • Multimodal learning improves accuracy
  • Embeddings are the foundation
  • Threshold tuning is critical
  • Scalability requires smart indexing

๐Ÿ“Œ Final Thoughts

Matching advertisements to products is not just a machine learning task—it’s a system design challenge. The best solutions combine strong modeling, efficient computation, and continuous evaluation.

If implemented correctly, this approach can significantly improve automation, reduce manual effort, and enhance user experience in any data-driven platform.

Monday, October 14, 2024

Automated Financial News Summarization and Evaluation Using BLEU Score

The task is to generate a summary of financial news related to specific stock symbols by pulling recent news articles. After generating the summary, it compares the summary to a reference summary using the **BLEU score**, which is a common metric for evaluating the quality of text summaries or translations.

The key objectives are:
1. **Fetch financial news articles**: Gather recent news articles related to stocks and combine the article content into a single document.
2. **Summarize the articles**: Automatically generate a summary from the combined news content using text clustering.
3. **Evaluate the summary**: Compare the generated summary with a provided reference summary using the **BLEU score** to measure how close the generated summary is to the reference.

### Solution

1. **Fetching Financial News Articles**:
   - The script uses the `NewsAPI` to fetch news articles related to stock symbols. These symbols are retrieved by the function `get_stocks_with_news`.
   - Articles are filtered to keep only those with valid titles and descriptions, and their content is combined into a single document. The text is pulled from articles published between August 17, 2023, and September 1, 2023.

2. **Generating a Summary**:
   - The script then breaks the document into sentences using **sentence tokenization** and cleans the sentences by tokenizing words and removing stopwords (common words like "the", "and").
   - A **similarity matrix** is built, which calculates the similarity between every pair of sentences using cosine distance. This helps in clustering similar sentences together.
   - The sentences are grouped into clusters using **KMeans clustering**, and from each cluster, representative sentences are chosen to form the summary.
   - The summary is composed of the key sentences from these clusters, attempting to cover the most important points from the news articles.

3. **Evaluating the Summary**:
   - A **reference summary** is provided (manually written or taken from reliable sources).
   - The generated summary is compared to the reference summary using the **BLEU score**. This score measures how well the generated summary matches the reference by looking at the overlap of words and phrases between the two summaries.
   - A BLEU score is then calculated and printed, which provides a numerical evaluation of the quality of the generated summary.

4. **Results**:
   - The generated summary is printed, followed by the reference summary and the **BLEU score**.
   - A higher BLEU score would indicate that the generated summary closely matches the reference, while a lower score would suggest that the generated summary deviates significantly from the expected content.

### Interpretation of the BLEU Score

- The **BLEU score** ranges from 0 to 1, where:
  - 1 means the generated summary is a perfect match with the reference summary.
  - 0 means there is no similarity between the generated summary and the reference.
- In this case, the BLEU score helps assess how accurately the summarization model captures the key points compared to a human-generated or reference summary.

This process offers a systematic approach to summarizing financial news and evaluating the quality of the summaries in a measurable way.

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts