Understanding Embeddings

Word Embeddings: Word2Vec and GloVe

Word embeddings were the first breakthrough in representing words as dense vectors that capture semantic meaning. Understanding these foundational models helps build intuition for modern embedding techniques.

The Problem: Representing Words for Machines

Computers process numbers, not words. How do we represent words numerically while preserving their meaning?

One-Hot Encoding: The Naive Approach

Vocabulary: [cat, dog, bird, fish]

cat  = [1, 0, 0, 0]
dog  = [0, 1, 0, 0]
bird = [0, 0, 1, 0]
fish = [0, 0, 0, 1]

Problems:
1. High dimensionality (vocab size = 50K+ dimensions)
2. Sparse (mostly zeros)
3. No semantic information: cat and dog are equidistant
4. "cat" is as similar to "dog" as it is to "democracy"

The Distributional Hypothesis

"You shall know a word by the company it keeps" - J.R. Firth (1957)

Words that appear in similar contexts tend to have similar meanings. This is the foundation of all word embedding methods.

Context examples:
"The ___ chased the mouse"     → cat, dog, ferret
"I love my pet ___"            → cat, dog, hamster
"The ___ barked loudly"        → dog (only)
"The ___ purred softly"        → cat (only)

Words that fill similar blanks → similar meanings

Word2Vec (2013)

Word2Vec, introduced by Mikolov et al. at Google, was the breakthrough that made dense word embeddings practical. It uses neural networks to learn word vectors from large text corpora.

Two Architectures

1. Skip-gram: Predict Context from Word

Given: "The quick brown fox jumps"
Target word: "fox"
Window size: 2

Predict: [brown, quick] and [jumps, over]

Training pairs:
(fox, brown), (fox, quick), (fox, jumps), (fox, over)

Model learns: words that appear in similar contexts 
              → similar embeddings

2. CBOW (Continuous Bag of Words): Predict Word from Context

Given context: [brown, quick, jumps, over]
Predict: "fox"

CBOW is faster but Skip-gram works better for rare words

Skip-gram Architecture

Input (one-hot)    Hidden (embedding)    Output (softmax)
   [V x 1]         →     [D x 1]       →     [V x 1]
   
   "fox"           →   [0.2, -0.5, ...]  →  P(brown|fox)
   
V = vocabulary size (e.g., 50,000)
D = embedding dimension (e.g., 300)

The hidden layer IS the word embedding!

Training Objective

# Maximize probability of context words given target word
# For each (target, context) pair:

P(context | target) = softmax(W_context · W_target)

# Problem: softmax over 50K vocabulary is expensive!
# Solutions:
# 1. Negative Sampling: Sample k random "negative" words
# 2. Hierarchical Softmax: Use binary tree structure

Negative Sampling

# Instead of computing softmax over entire vocabulary,
# sample k negative examples

# For pair (fox, brown):
# Positive: maximize P(brown | fox) 
# Negatives: minimize P(democracy | fox), P(quantum | fox), ...

loss = -log(σ(v_context · v_target)) 
       - Σ log(σ(-v_negative · v_target))

# Typical k = 5-20 negative samples

Using Word2Vec

from gensim.models import Word2Vec

# Training from scratch
sentences = [
    ["the", "quick", "brown", "fox"],
    ["machine", "learning", "is", "fun"],
    # ... more sentences
]

model = Word2Vec(
    sentences,
    vector_size=300,    # Embedding dimensions
    window=5,           # Context window size
    min_count=5,        # Ignore rare words
    workers=4,          # Parallel threads
    sg=1,               # 1 for Skip-gram, 0 for CBOW
    epochs=10
)

# Get embedding for a word
fox_embedding = model.wv['fox']  # Shape: (300,)

# Find similar words
similar = model.wv.most_similar('king', topn=5)
# [('queen', 0.85), ('prince', 0.78), ('monarch', 0.75), ...]

# Word analogies
result = model.wv.most_similar(
    positive=['king', 'woman'],
    negative=['man']
)
# [('queen', 0.89), ...]

GloVe (2014)

GloVe (Global Vectors) from Stanford combines the strengths of count-based methods and prediction-based methods like Word2Vec.

Key Insight

Word co-occurrence statistics contain rich semantic information. GloVe directly factorizes the co-occurrence matrix.

Co-occurrence Matrix

Count how often words appear together in a context window:

        the   cat   sat   on    mat
the      -     5     2     8     1
cat      5     -     7     3     4
sat      2     7     -     2     3
on       8     3     2     -     1
mat      1     4     3     1     -

X[i,j] = count of word i appearing near word j

GloVe Objective

# Learn vectors such that:
# w_i · w_j ≈ log(X[i,j])

# Objective function:
J = Σ f(X_ij) * (w_i · w_j + b_i + b_j - log(X_ij))²

# f(x) is a weighting function that:
# - Downweights very frequent pairs (the, a, is)
# - Doesn't over-weight rare pairs

Using Pre-trained GloVe

import numpy as np

def load_glove(path: str) -> dict:
    """Load pre-trained GloVe embeddings."""
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Load GloVe (download from: nlp.stanford.edu/projects/glove/)
glove = load_glove('glove.6B.300d.txt')

# Get embedding
king = glove['king']  # Shape: (300,)

# Similarity (cosine)
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim = cosine_similarity(glove['king'], glove['queen'])
print(f"Similarity: {sim:.3f}")  # ~0.75

FastText (2016)

FastText from Facebook extends Word2Vec by using subword information.

Subword Embeddings

Word: "unhappiness"
Character n-grams (n=3): <un, unh, nha, hap, app, ppi, pin, ine, nes, ess, ss>
Plus the word itself: <unhappiness>

Word embedding = sum of all subword embeddings

Benefits:
1. Handle out-of-vocabulary words
2. Better for morphologically rich languages
3. Share information between related words: "happy", "happier", "happiness"

from gensim.models import FastText

model = FastText(
    sentences,
    vector_size=300,
    window=5,
    min_count=5,
    min_n=3,  # Minimum n-gram length
    max_n=6   # Maximum n-gram length
)

# Can get embeddings for OOV words!
oov_embedding = model.wv['asdfghjkl']  # Works by summing subword vectors

Word Embedding Properties

Semantic Relationships

Similar words are close in vector space:
- happy, joyful, pleased (cluster together)
- sad, unhappy, miserable (cluster together)
- These clusters are far apart

Linear Analogies

Famous examples:
king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
walked - walk + swim ≈ swam

The relationship is captured as a vector offset!
vec(king) - vec(man) ≈ vec(queen) - vec(woman)

Limitations of Word Embeddings

Context-Independent - Same embedding for "bank" whether it's a river bank or money bank
Word-Level Only - No direct way to get sentence or document embeddings
Static - Embeddings don't change based on surrounding text
OOV Problem - Word2Vec/GloVe can't handle words not in vocabulary (FastText solves this)

Comparison: Word2Vec vs GloVe vs FastText

Aspect	Word2Vec	GloVe	FastText
Method	Prediction-based	Count + Prediction	Prediction + Subwords
OOV Handling	❌ No	❌ No	✅ Yes (subwords)
Speed	Fast	Fast	Slower (more params)
Best For	General NLP	General NLP	Morphology-rich langs

Practical Exercise

# Explore word embeddings interactively
import gensim.downloader as api

# Download pre-trained Word2Vec (Google News, 3M words, 300d)
model = api.load("word2vec-google-news-300")

# Explore relationships
print(model.most_similar('king'))
print(model.most_similar(positive=['paris', 'germany'], negative=['france']))
print(model.most_similar(positive=['walking', 'swam'], negative=['walked']))

# Doesn't match (find the odd one out)
print(model.doesnt_match(['breakfast', 'lunch', 'dinner', 'computer']))

# Similarity
print(model.similarity('cat', 'dog'))
print(model.similarity('cat', 'democracy'))

Key Takeaways

Word embeddings represent words as dense vectors capturing semantic meaning
Word2Vec learns from local context windows using prediction
GloVe combines global co-occurrence statistics with prediction
FastText adds subword information for OOV handling
All have a key limitation: context-independent (same embedding regardless of usage)
For RAG, we need sentence/document embeddings - covered next lesson

In the next lesson, we'll explore sentence and document embeddings - the type we actually use in RAG systems.

Vector Databases & Embeddings