Word embeddings were the first breakthrough in representing words as dense vectors that capture semantic meaning. Understanding these foundational models helps build intuition for modern embedding techniques.
Computers process numbers, not words. How do we represent words numerically while preserving their meaning?
Vocabulary: [cat, dog, bird, fish]
cat = [1, 0, 0, 0]
dog = [0, 1, 0, 0]
bird = [0, 0, 1, 0]
fish = [0, 0, 0, 1]
Problems:
1. High dimensionality (vocab size = 50K+ dimensions)
2. Sparse (mostly zeros)
3. No semantic information: cat and dog are equidistant
4. "cat" is as similar to "dog" as it is to "democracy""You shall know a word by the company it keeps" - J.R. Firth (1957)
Words that appear in similar contexts tend to have similar meanings. This is the foundation of all word embedding methods.
Context examples:
"The ___ chased the mouse" → cat, dog, ferret
"I love my pet ___" → cat, dog, hamster
"The ___ barked loudly" → dog (only)
"The ___ purred softly" → cat (only)
Words that fill similar blanks → similar meaningsWord2Vec, introduced by Mikolov et al. at Google, was the breakthrough that made dense word embeddings practical. It uses neural networks to learn word vectors from large text corpora.
Given: "The quick brown fox jumps"
Target word: "fox"
Window size: 2
Predict: [brown, quick] and [jumps, over]
Training pairs:
(fox, brown), (fox, quick), (fox, jumps), (fox, over)
Model learns: words that appear in similar contexts
→ similar embeddingsGiven context: [brown, quick, jumps, over]
Predict: "fox"
CBOW is faster but Skip-gram works better for rare wordsInput (one-hot) Hidden (embedding) Output (softmax)
[V x 1] → [D x 1] → [V x 1]
"fox" → [0.2, -0.5, ...] → P(brown|fox)
V = vocabulary size (e.g., 50,000)
D = embedding dimension (e.g., 300)
The hidden layer IS the word embedding!# Maximize probability of context words given target word
# For each (target, context) pair:
P(context | target) = softmax(W_context · W_target)
# Problem: softmax over 50K vocabulary is expensive!
# Solutions:
# 1. Negative Sampling: Sample k random "negative" words
# 2. Hierarchical Softmax: Use binary tree structure# Instead of computing softmax over entire vocabulary,
# sample k negative examples
# For pair (fox, brown):
# Positive: maximize P(brown | fox)
# Negatives: minimize P(democracy | fox), P(quantum | fox), ...
loss = -log(σ(v_context · v_target))
- Σ log(σ(-v_negative · v_target))
# Typical k = 5-20 negative samplesfrom gensim.models import Word2Vec
# Training from scratch
sentences = [
["the", "quick", "brown", "fox"],
["machine", "learning", "is", "fun"],
# ... more sentences
]
model = Word2Vec(
sentences,
vector_size=300, # Embedding dimensions
window=5, # Context window size
min_count=5, # Ignore rare words
workers=4, # Parallel threads
sg=1, # 1 for Skip-gram, 0 for CBOW
epochs=10
)
# Get embedding for a word
fox_embedding = model.wv['fox'] # Shape: (300,)
# Find similar words
similar = model.wv.most_similar('king', topn=5)
# [('queen', 0.85), ('prince', 0.78), ('monarch', 0.75), ...]
# Word analogies
result = model.wv.most_similar(
positive=['king', 'woman'],
negative=['man']
)
# [('queen', 0.89), ...]GloVe (Global Vectors) from Stanford combines the strengths of count-based methods and prediction-based methods like Word2Vec.
Word co-occurrence statistics contain rich semantic information. GloVe directly factorizes the co-occurrence matrix.
Count how often words appear together in a context window:
the cat sat on mat
the - 5 2 8 1
cat 5 - 7 3 4
sat 2 7 - 2 3
on 8 3 2 - 1
mat 1 4 3 1 -
X[i,j] = count of word i appearing near word j# Learn vectors such that:
# w_i · w_j ≈ log(X[i,j])
# Objective function:
J = Σ f(X_ij) * (w_i · w_j + b_i + b_j - log(X_ij))²
# f(x) is a weighting function that:
# - Downweights very frequent pairs (the, a, is)
# - Doesn't over-weight rare pairsimport numpy as np
def load_glove(path: str) -> dict:
"""Load pre-trained GloVe embeddings."""
embeddings = {}
with open(path, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype='float32')
embeddings[word] = vector
return embeddings
# Load GloVe (download from: nlp.stanford.edu/projects/glove/)
glove = load_glove('glove.6B.300d.txt')
# Get embedding
king = glove['king'] # Shape: (300,)
# Similarity (cosine)
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sim = cosine_similarity(glove['king'], glove['queen'])
print(f"Similarity: {sim:.3f}") # ~0.75FastText from Facebook extends Word2Vec by using subword information.
Word: "unhappiness"
Character n-grams (n=3): <un, unh, nha, hap, app, ppi, pin, ine, nes, ess, ss>
Plus the word itself: <unhappiness>
Word embedding = sum of all subword embeddings
Benefits:
1. Handle out-of-vocabulary words
2. Better for morphologically rich languages
3. Share information between related words: "happy", "happier", "happiness"from gensim.models import FastText
model = FastText(
sentences,
vector_size=300,
window=5,
min_count=5,
min_n=3, # Minimum n-gram length
max_n=6 # Maximum n-gram length
)
# Can get embeddings for OOV words!
oov_embedding = model.wv['asdfghjkl'] # Works by summing subword vectorsSimilar words are close in vector space:
- happy, joyful, pleased (cluster together)
- sad, unhappy, miserable (cluster together)
- These clusters are far apartFamous examples:
king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
walked - walk + swim ≈ swam
The relationship is captured as a vector offset!
vec(king) - vec(man) ≈ vec(queen) - vec(woman)| Aspect | Word2Vec | GloVe | FastText |
|---|---|---|---|
| Method | Prediction-based | Count + Prediction | Prediction + Subwords |
| OOV Handling | ❌ No | ❌ No | ✅ Yes (subwords) |
| Speed | Fast | Fast | Slower (more params) |
| Best For | General NLP | General NLP | Morphology-rich langs |
# Explore word embeddings interactively
import gensim.downloader as api
# Download pre-trained Word2Vec (Google News, 3M words, 300d)
model = api.load("word2vec-google-news-300")
# Explore relationships
print(model.most_similar('king'))
print(model.most_similar(positive=['paris', 'germany'], negative=['france']))
print(model.most_similar(positive=['walking', 'swam'], negative=['walked']))
# Doesn't match (find the odd one out)
print(model.doesnt_match(['breakfast', 'lunch', 'dinner', 'computer']))
# Similarity
print(model.similarity('cat', 'dog'))
print(model.similarity('cat', 'democracy'))In the next lesson, we'll explore sentence and document embeddings - the type we actually use in RAG systems.