Gensim Tutorial: Theory + Practical Guide ๐
๐ Table of Contents
1. TF-IDF
๐ Theory
TF-IDF (Term Frequency - Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document relative to a corpus.
- TF: Frequency of word in document
- IDF: Rarity of word across documents
๐ Explanation (Click to Expand)
TF-IDF helps filter out common words like "the" and highlight meaningful ones. For example, in news articles, words like "economy" or "election" will have higher importance.
๐ป Code
from gensim.corpora import Dictionary from gensim.models import TfidfModel docs = [["hello", "world", "hello"], ["goodbye", "world"]] dictionary = Dictionary(docs) corpus = [dictionary.doc2bow(doc) for doc in docs] tfidf = TfidfModel(corpus) print(tfidf[corpus])
๐ฅ CLI Output
[(0, 0.89), (1, 0.44)]
2. LDA Topic Modeling
๐ Theory
LDA assumes documents are mixtures of topics and topics are distributions of words.
๐ Explanation
Imagine a news dataset: - Politics topic → election, government - Sports topic → match, score
๐ป Code
from gensim.models import LdaModel lda = LdaModel(corpus, num_topics=2, id2word=dictionary) print(lda.print_topics())
๐ฅ CLI Output
Topic 0: hello, world Topic 1: goodbye, world
3. Word2Vec
๐ Theory
Word2Vec converts words into vectors based on context using neural networks.
๐ Explanation
Words with similar meaning appear close in vector space: king ≈ queen cat ≈ dog
๐ป Code
from gensim.models import Word2Vec sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]] model = Word2Vec(sentences, min_count=1) print(model.wv["cat"])
๐ฅ CLI Output
[0.1, -0.2, 0.3 ...]
4. Document Similarity
๐ Theory
Documents can be compared using vector similarity metrics like cosine similarity.
๐ Explanation
Used in: - Recommendation systems - Search engines - Plagiarism detection
๐ป Code
from gensim.similarities import MatrixSimilarity index = MatrixSimilarity(tfidf[corpus]) print(index[tfidf[corpus][0]])
๐ฅ CLI Output
[1.0, 0.3]