Vector Databases & Embeddings

0 of 12 lessons completed

Word Embeddings: Word2Vec and GloVe

Word embeddings were the first breakthrough in representing words as dense vectors that capture semantic meaning. Understanding these foundational models helps build intuition for modern embedding techniques.

The Problem: Representing Words for Machines

Computers process numbers, not words. How do we represent words numerically while preserving their meaning?

One-Hot Encoding: The Naive Approach

Vocabulary: [cat, dog, bird, fish]

cat  = [1, 0, 0, 0]
dog  = [0, 1, 0, 0]
bird = [0, 0, 1, 0]
fish = [0, 0, 0, 1]

Problems:
1. High dimensionality (vocab size = 50K+ dimensions)
2. Sparse (mostly zeros)
3. No semantic information: cat and dog are equidistant
4. "cat" is as similar to "dog" as it is to "democracy"

The Distributional Hypothesis

"You shall know a word by the company it keeps" - J.R. Firth (1957)

Words that appear in similar contexts tend to have similar meanings. This is the foundation of all word embedding methods.

Context examples:
"The ___ chased the mouse"     → cat, dog, ferret
"I love my pet ___"            → cat, dog, hamster
"The ___ barked loudly"        → dog (only)
"The ___ purred softly"        → cat (only)

Words that fill similar blanks → similar meanings

Word2Vec (2013)

Word2Vec, introduced by Mikolov et al. at Google, was the breakthrough that made dense word embeddings practical. It uses neural networks to learn word vectors from large text corpora.

Two Architectures

1. Skip-gram: Predict Context from Word

Given: "The quick brown fox jumps"
Target word: "fox"
Window size: 2

Predict: [brown, quick] and [jumps, over]

Training pairs:
(fox, brown), (fox, quick), (fox, jumps), (fox, over)

Model learns: words that appear in similar contexts 
              → similar embeddings

2. CBOW (Continuous Bag of Words): Predict Word from Context

Given context: [brown, quick, jumps, over]
Predict: "fox"

CBOW is faster but Skip-gram works better for rare words

Skip-gram Architecture

Input (one-hot)    Hidden (embedding)    Output (softmax)
   [V x 1]         →     [D x 1]       →     [V x 1]
   
   "fox"           →   [0.2, -0.5, ...]  →  P(brown|fox)
   
V = vocabulary size (e.g., 50,000)
D = embedding dimension (e.g., 300)

The hidden layer IS the word embedding!

Training Objective

# Maximize probability of context words given target word
# For each (target, context) pair:

P(context | target) = softmax(W_context · W_target)

# Problem: softmax over 50K vocabulary is expensive!
# Solutions:
# 1. Negative Sampling: Sample k random "negative" words
# 2. Hierarchical Softmax: Use binary tree structure

Negative Sampling

# Instead of computing softmax over entire vocabulary,
# sample k negative examples

# For pair (fox, brown):
# Positive: maximize P(brown | fox) 
# Negatives: minimize P(democracy | fox), P(quantum | fox), ...

loss = -log(σ(v_context · v_target)) 
       - Σ log(σ(-v_negative · v_target))

# Typical k = 5-20 negative samples

Using Word2Vec

from gensim.models import Word2Vec

# Training from scratch
sentences = [
    ["the", "quick", "brown", "fox"],
    ["machine", "learning", "is", "fun"],
    # ... more sentences
]

model = Word2Vec(
    sentences,
    vector_size=300,    # Embedding dimensions
    window=5,           # Context window size
    min_count=5,        # Ignore rare words
    workers=4,          # Parallel threads
    sg=1,               # 1 for Skip-gram, 0 for CBOW
    epochs=10
)

# Get embedding for a word
fox_embedding = model.wv['fox']  # Shape: (300,)

# Find similar words
similar = model.wv.most_similar('king', topn=5)
# [('queen', 0.85), ('prince', 0.78), ('monarch', 0.75), ...]

# Word analogies
result = model.wv.most_similar(
    positive=['king', 'woman'],
    negative=['man']
)
# [('queen', 0.89), ...]

GloVe (2014)

GloVe (Global Vectors) from Stanford combines the strengths of count-based methods and prediction-based methods like Word2Vec.

Key Insight

Word co-occurrence statistics contain rich semantic information. GloVe directly factorizes the co-occurrence matrix.

Co-occurrence Matrix

Count how often words appear together in a context window:

        the   cat   sat   on    mat
the      -     5     2     8     1
cat      5     -     7     3     4
sat      2     7     -     2     3
on       8     3     2     -     1
mat      1     4     3     1     -

X[i,j] = count of word i appearing near word j

GloVe Objective

# Learn vectors such that:
# w_i · w_j ≈ log(X[i,j])

# Objective function:
J = Σ f(X_ij) * (w_i · w_j + b_i + b_j - log(X_ij))²

# f(x) is a weighting function that:
# - Downweights very frequent pairs (the, a, is)
# - Doesn't over-weight rare pairs

Using Pre-trained GloVe

import numpy as np

def load_glove(path: str) -> dict:
    """Load pre-trained GloVe embeddings."""
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Load GloVe (download from: nlp.stanford.edu/projects/glove/)
glove = load_glove('glove.6B.300d.txt')

# Get embedding
king = glove['king']  # Shape: (300,)

# Similarity (cosine)
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim = cosine_similarity(glove['king'], glove['queen'])
print(f"Similarity: {sim:.3f}")  # ~0.75

FastText (2016)

FastText from Facebook extends Word2Vec by using subword information.

Subword Embeddings

Word: "unhappiness"
Character n-grams (n=3): <un, unh, nha, hap, app, ppi, pin, ine, nes, ess, ss>
Plus the word itself: <unhappiness>

Word embedding = sum of all subword embeddings

Benefits:
1. Handle out-of-vocabulary words
2. Better for morphologically rich languages
3. Share information between related words: "happy", "happier", "happiness"
from gensim.models import FastText

model = FastText(
    sentences,
    vector_size=300,
    window=5,
    min_count=5,
    min_n=3,  # Minimum n-gram length
    max_n=6   # Maximum n-gram length
)

# Can get embeddings for OOV words!
oov_embedding = model.wv['asdfghjkl']  # Works by summing subword vectors

Word Embedding Properties

Semantic Relationships

Similar words are close in vector space:
- happy, joyful, pleased (cluster together)
- sad, unhappy, miserable (cluster together)
- These clusters are far apart

Linear Analogies

Famous examples:
king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
walked - walk + swim ≈ swam

The relationship is captured as a vector offset!
vec(king) - vec(man) ≈ vec(queen) - vec(woman)

Limitations of Word Embeddings

  • Context-Independent - Same embedding for "bank" whether it's a river bank or money bank
  • Word-Level Only - No direct way to get sentence or document embeddings
  • Static - Embeddings don't change based on surrounding text
  • OOV Problem - Word2Vec/GloVe can't handle words not in vocabulary (FastText solves this)

Comparison: Word2Vec vs GloVe vs FastText

AspectWord2VecGloVeFastText
MethodPrediction-basedCount + PredictionPrediction + Subwords
OOV Handling❌ No❌ No✅ Yes (subwords)
SpeedFastFastSlower (more params)
Best ForGeneral NLPGeneral NLPMorphology-rich langs

Practical Exercise

# Explore word embeddings interactively
import gensim.downloader as api

# Download pre-trained Word2Vec (Google News, 3M words, 300d)
model = api.load("word2vec-google-news-300")

# Explore relationships
print(model.most_similar('king'))
print(model.most_similar(positive=['paris', 'germany'], negative=['france']))
print(model.most_similar(positive=['walking', 'swam'], negative=['walked']))

# Doesn't match (find the odd one out)
print(model.doesnt_match(['breakfast', 'lunch', 'dinner', 'computer']))

# Similarity
print(model.similarity('cat', 'dog'))
print(model.similarity('cat', 'democracy'))

Key Takeaways

  • Word embeddings represent words as dense vectors capturing semantic meaning
  • Word2Vec learns from local context windows using prediction
  • GloVe combines global co-occurrence statistics with prediction
  • FastText adds subword information for OOV handling
  • All have a key limitation: context-independent (same embedding regardless of usage)
  • For RAG, we need sentence/document embeddings - covered next lesson

In the next lesson, we'll explore sentence and document embeddings - the type we actually use in RAG systems.