Showing posts with label TF-IDF. Show all posts

Friday, November 15, 2024

How TF-IDF Works in Natural Language Processing

TF-IDF Explained in Depth | Complete Guide for Beginners & Professionals

TF-IDF Explained: The Ultimate Guide to Text Importance

📌 Table of Contents

Introduction
What is TF-IDF?
Term Frequency (TF)
Inverse Document Frequency (IDF)
Mathematical Insight
TF-IDF Calculation
Worked Example
Code Example
Applications
Key Takeaways
Related Articles

Introduction

When working with text data, one of the biggest challenges is identifying which words actually matter. Not all words carry equal importance — common words like “the”, “is”, and “and” appear everywhere.

TF-IDF solves this by assigning weights to words based on their importance.

💡 Core Idea: Important words appear frequently in a document but rarely across all documents.

What is TF-IDF?

TF-IDF stands for Term Frequency – Inverse Document Frequency. It is a numerical statistic used in information retrieval and machine learning.

Highlights important words
Reduces noise from common terms
Improves search and classification

TF-IDF balances two forces:

Term importance inside a document
Term rarity across documents

Term Frequency (TF)

Term Frequency measures how often a word appears in a document.

$$ TF(t) = \frac{\text{Count of term t in document}}{\text{Total words in document}} $$

Example:

$$ TF("data") = \frac{3}{100} = 0.03 $$

Inverse Document Frequency (IDF)

IDF measures how rare a word is across all documents.

$$ IDF(t) = \log \left(\frac{N}{df}\right) $$

N = total documents
df = documents containing the term

Example:

$$ IDF("data") = \log\left(\frac{1000}{200}\right) = \log(5) \approx 0.7 $$

Logarithm smooths values and prevents extremely rare words from dominating scores excessively.

📊 Mathematical Insight

TF-IDF combines probability and information theory concepts.

It relates to entropy and information gain:

$$ Information = -\log(P(t)) $$

Rare words carry more information than common words.

TF-IDF Formula

$$ TFIDF(t) = TF(t) \times IDF(t) $$

This gives a weighted importance score.

Worked Example

Let’s calculate TF-IDF step by step:

TF = 0.03
IDF = 0.7

$$ TFIDF = 0.03 \times 0.7 = 0.021 $$

💡 Higher TF-IDF = More important word

💻 Code Example (Python)

Example Code


from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
 "data science is fun",
 "machine learning uses data",
 "data is important"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names_out())
print(X.toarray())

Output


['data', 'fun', 'important', 'learning', 'machine', 'science', 'uses']
[[0.5 0.7 0.0 ...]]

Real-World Applications

Search Engines
Spam Detection
Document Clustering
Recommendation Systems
Sentiment Analysis

Search engines rank documents based on TF-IDF scores of query terms.

🎯 Key Takeaways

TF measures frequency in a document
IDF measures rarity across documents
TF-IDF balances both
Widely used in NLP and search systems

Conclusion

TF-IDF remains one of the most powerful yet simple techniques in text analysis. It bridges the gap between raw text and meaningful data representation.

Whether you're building a search engine or training ML models, understanding TF-IDF is essential.

Thursday, November 7, 2024

Getting Started with Gensim for NLP: A Guide to Topic Modeling, Word Embeddings, and Document Similarity

Gensim Tutorial: Complete Guide to NLP

Gensim Tutorial: Theory + Practical Guide 🚀

1. TF-IDF

📘 Theory

TF-IDF (Term Frequency - Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document relative to a corpus.

TF: Frequency of word in document
IDF: Rarity of word across documents

🔍 Explanation (Click to Expand)

TF-IDF helps filter out common words like "the" and highlight meaningful ones. For example, in news articles, words like "economy" or "election" will have higher importance.

💻 Code

from gensim.corpora import Dictionary
from gensim.models import TfidfModel

docs = [["hello", "world", "hello"], ["goodbye", "world"]]

dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]

tfidf = TfidfModel(corpus)
print(tfidf[corpus])

🖥 CLI Output

[(0, 0.89), (1, 0.44)]

2. LDA Topic Modeling

📘 Theory

LDA assumes documents are mixtures of topics and topics are distributions of words.

🔍 Explanation

Imagine a news dataset: - Politics topic → election, government - Sports topic → match, score

💻 Code

from gensim.models import LdaModel

lda = LdaModel(corpus, num_topics=2, id2word=dictionary)

print(lda.print_topics())

🖥 CLI Output

Topic 0: hello, world
Topic 1: goodbye, world

3. Word2Vec

📘 Theory

Word2Vec converts words into vectors based on context using neural networks.

🔍 Explanation

Words with similar meaning appear close in vector space: king ≈ queen cat ≈ dog

💻 Code

from gensim.models import Word2Vec

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

model = Word2Vec(sentences, min_count=1)

print(model.wv["cat"])

🖥 CLI Output

[0.1, -0.2, 0.3 ...]

4. Document Similarity

📘 Theory

Documents can be compared using vector similarity metrics like cosine similarity.

🔍 Explanation

Used in: - Recommendation systems - Search engines - Plagiarism detection

💻 Code

from gensim.similarities import MatrixSimilarity

index = MatrixSimilarity(tfidf[corpus])
print(index[tfidf[corpus][0]])

🖥 CLI Output

[1.0, 0.3]

Pages

Friday, November 15, 2024

TF-IDF Explained: The Ultimate Guide to Text Importance

📌 Table of Contents

Introduction

What is TF-IDF?

Term Frequency (TF)

Inverse Document Frequency (IDF)

📊 Mathematical Insight

TF-IDF Formula

Worked Example

💻 Code Example (Python)

Example Code

Output

Real-World Applications

🎯 Key Takeaways

Conclusion

Thursday, November 7, 2024

Gensim Tutorial: Theory + Practical Guide 🚀

📚 Table of Contents

1. TF-IDF

📘 Theory

💻 Code

🖥 CLI Output

2. LDA Topic Modeling

📘 Theory

💻 Code

🖥 CLI Output

3. Word2Vec

📘 Theory

💻 Code

🖥 CLI Output

4. Document Similarity

📘 Theory

💻 Code

🖥 CLI Output

📚 Related Articles

Featured Post

Popular Posts

🧠 AI Quiz

🎯 Guess Game

⚡ Speed Test

✊ Rock Paper Scissors

🔢 Quick Math

🧩 Memory Game

⌨️ Typing Speed

🟥 Color Click

🎲 Dice Game

Latest Posts

AI Category

🚀 Trending AI Projects

📊 Data Science Resources

📚 Latest Research Papers

🔥 New AI Tools

💬 Developer Discussions

Contact Form

Followers