Showing posts with label TF-IDF. Show all posts
Showing posts with label TF-IDF. Show all posts

Friday, November 15, 2024

How TF-IDF Works in Natural Language Processing


TF-IDF Explained in Depth | Complete Guide for Beginners & Professionals

TF-IDF Explained: The Ultimate Guide to Text Importance

๐Ÿ“Œ Table of Contents


Introduction

When working with text data, one of the biggest challenges is identifying which words actually matter. Not all words carry equal importance — common words like “the”, “is”, and “and” appear everywhere.

TF-IDF solves this by assigning weights to words based on their importance.

๐Ÿ’ก Core Idea: Important words appear frequently in a document but rarely across all documents.

What is TF-IDF?

TF-IDF stands for Term Frequency – Inverse Document Frequency. It is a numerical statistic used in information retrieval and machine learning.

  • Highlights important words
  • Reduces noise from common terms
  • Improves search and classification

TF-IDF balances two forces:

  • Term importance inside a document
  • Term rarity across documents

Term Frequency (TF)

Term Frequency measures how often a word appears in a document.

$$ TF(t) = \frac{\text{Count of term t in document}}{\text{Total words in document}} $$

Example:

$$ TF("data") = \frac{3}{100} = 0.03 $$


Inverse Document Frequency (IDF)

IDF measures how rare a word is across all documents.

$$ IDF(t) = \log \left(\frac{N}{df}\right) $$

  • N = total documents
  • df = documents containing the term

Example:

$$ IDF("data") = \log\left(\frac{1000}{200}\right) = \log(5) \approx 0.7 $$

Logarithm smooths values and prevents extremely rare words from dominating scores excessively.


๐Ÿ“Š Mathematical Insight

TF-IDF combines probability and information theory concepts.

It relates to entropy and information gain:

$$ Information = -\log(P(t)) $$

Rare words carry more information than common words.


TF-IDF Formula

$$ TFIDF(t) = TF(t) \times IDF(t) $$

This gives a weighted importance score.


Worked Example

Let’s calculate TF-IDF step by step:

  • TF = 0.03
  • IDF = 0.7

$$ TFIDF = 0.03 \times 0.7 = 0.021 $$

๐Ÿ’ก Higher TF-IDF = More important word

๐Ÿ’ป Code Example (Python)

Example Code

from sklearn.feature_extraction.text import TfidfVectorizer docs = [ "data science is fun", "machine learning uses data", "data is important" ] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(docs) print(vectorizer.get_feature_names_out()) print(X.toarray())

Output

['data', 'fun', 'important', 'learning', 'machine', 'science', 'uses'] [[0.5 0.7 0.0 ...]]

Real-World Applications

  • Search Engines
  • Spam Detection
  • Document Clustering
  • Recommendation Systems
  • Sentiment Analysis

Search engines rank documents based on TF-IDF scores of query terms.


๐ŸŽฏ Key Takeaways

  • TF measures frequency in a document
  • IDF measures rarity across documents
  • TF-IDF balances both
  • Widely used in NLP and search systems

Conclusion

TF-IDF remains one of the most powerful yet simple techniques in text analysis. It bridges the gap between raw text and meaningful data representation.

Whether you're building a search engine or training ML models, understanding TF-IDF is essential.

Thursday, November 7, 2024

Getting Started with Gensim for NLP: A Guide to Topic Modeling, Word Embeddings, and Document Similarity

Gensim Tutorial: Complete Guide to NLP

Gensim Tutorial: Theory + Practical Guide ๐Ÿš€


๐Ÿ“š Table of Contents


1. TF-IDF

๐Ÿ“˜ Theory

TF-IDF (Term Frequency - Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document relative to a corpus.

  • TF: Frequency of word in document
  • IDF: Rarity of word across documents
๐Ÿ” Explanation (Click to Expand)

TF-IDF helps filter out common words like "the" and highlight meaningful ones. For example, in news articles, words like "economy" or "election" will have higher importance.

๐Ÿ’ป Code

from gensim.corpora import Dictionary
from gensim.models import TfidfModel

docs = [["hello", "world", "hello"], ["goodbye", "world"]]

dictionary = Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]

tfidf = TfidfModel(corpus)
print(tfidf[corpus])

๐Ÿ–ฅ CLI Output

[(0, 0.89), (1, 0.44)]

2. LDA Topic Modeling

๐Ÿ“˜ Theory

LDA assumes documents are mixtures of topics and topics are distributions of words.

๐Ÿ” Explanation

Imagine a news dataset: - Politics topic → election, government - Sports topic → match, score

๐Ÿ’ป Code

from gensim.models import LdaModel

lda = LdaModel(corpus, num_topics=2, id2word=dictionary)

print(lda.print_topics())

๐Ÿ–ฅ CLI Output

Topic 0: hello, world
Topic 1: goodbye, world

3. Word2Vec

๐Ÿ“˜ Theory

Word2Vec converts words into vectors based on context using neural networks.

๐Ÿ” Explanation

Words with similar meaning appear close in vector space: king ≈ queen cat ≈ dog

๐Ÿ’ป Code

from gensim.models import Word2Vec

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

model = Word2Vec(sentences, min_count=1)

print(model.wv["cat"])

๐Ÿ–ฅ CLI Output

[0.1, -0.2, 0.3 ...]

4. Document Similarity

๐Ÿ“˜ Theory

Documents can be compared using vector similarity metrics like cosine similarity.

๐Ÿ” Explanation

Used in: - Recommendation systems - Search engines - Plagiarism detection

๐Ÿ’ป Code

from gensim.similarities import MatrixSimilarity

index = MatrixSimilarity(tfidf[corpus])
print(index[tfidf[corpus][0]])

๐Ÿ–ฅ CLI Output

[1.0, 0.3]

๐Ÿ“š Related Articles

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts