Understanding Embeddings

Indexing Algorithms: ANN and HNSW

Approximate Nearest Neighbor (ANN) algorithms are the secret sauce that makes vector search practical at scale. This lesson explores the major indexing algorithms: HNSW, IVF, LSH, and Product Quantization.

Why We Need ANN Algorithms

Exact search: Check every vector
┌─────────────────────────────────────────────┐
│ Query → Compare with ALL 10M vectors → Sort │
│ Time: O(n × d) = O(10M × 768) ≈ 1 second   │
└─────────────────────────────────────────────┘

ANN: Smart data structures
┌─────────────────────────────────────────────┐
│ Query → Navigate index → Check ~1000 vectors│
│ Time: O(log n × d) ≈ 1-10 ms               │
└─────────────────────────────────────────────┘

100-1000x speedup with 95-99% accuracy!

HNSW (Hierarchical Navigable Small World)

HNSW is the most popular ANN algorithm, used by Pinecone, Weaviate, Qdrant, and others. It combines ideas from skip lists and navigable small-world graphs.

Key Insight: Small World Networks

In a small-world graph, you can reach any node from any other node in just a few hops (like "six degrees of separation").

HNSW Structure

Multi-layer graph structure:

Layer 2 (sparse):    A ←──────→ F
                      \         /
                       \       /
Layer 1 (medium):    A ← B ← C ← D ← E ← F
                      \   |   |   |   /
                       \  |   |   |  /
Layer 0 (dense):     A-B-C-D-E-F-G-H-I-J-K-L

- Higher layers: Long-distance connections (express lanes)
- Lower layers: Short-distance connections (local roads)
- Layer 0: All nodes with local connections

Search Algorithm

def hnsw_search(query, graph, k, ef):
    """
    HNSW search algorithm.
    
    Args:
        query: Query vector
        graph: HNSW graph structure
        k: Number of results
        ef: Search breadth (exploration factor)
    """
    # Start at top layer with entry point
    current_node = graph.entry_point
    
    # Navigate through layers top-to-bottom
    for layer in range(graph.max_layer, 0, -1):
        # Greedy search: move to closest neighbor
        while True:
            neighbors = graph.get_neighbors(current_node, layer)
            closest = min(neighbors, key=lambda n: distance(query, n))
            
            if distance(query, closest) >= distance(query, current_node):
                break  # Local minimum found
            current_node = closest
    
    # At layer 0: do beam search with ef candidates
    candidates = [current_node]
    visited = {current_node}
    results = []
    
    while candidates:
        node = candidates.pop()
        
        for neighbor in graph.get_neighbors(node, 0):
            if neighbor not in visited:
                visited.add(neighbor)
                candidates.append(neighbor)
                results.append((distance(query, neighbor), neighbor))
                
            if len(visited) >= ef:
                break
    
    # Return top k
    return sorted(results)[:k]

HNSW Parameters

Parameter	Effect	Typical Values
M	Max connections per node. Higher = better recall, more memory	16-64
ef_construction	Search breadth during index building. Higher = better quality, slower build	100-400
ef_search	Search breadth at query time. Higher = better recall, slower query	50-200

# HNSW in FAISS
import faiss

# Build index
dimension = 768
M = 32  # Connections per node
index = faiss.IndexHNSWFlat(dimension, M)

# Set construction parameter
index.hnsw.efConstruction = 200

# Add vectors
index.add(embeddings)

# Set search parameter (can change at query time)
index.hnsw.efSearch = 100

# Search
D, I = index.search(query, k=10)

IVF (Inverted File Index)

IVF clusters vectors and only searches within relevant clusters.

How IVF Works

1. Training Phase: Cluster vectors using k-means
   ┌─────────────────────────────────────────────┐
   │ 10M vectors → k-means with 1000 centroids   │
   │ Each vector assigned to nearest centroid    │
   └─────────────────────────────────────────────┘

2. Indexing: Store vectors in cluster buckets
   Cluster 1: [v23, v145, v892, ...]
   Cluster 2: [v7, v89, v234, ...]
   ...
   Cluster 1000: [v12, v567, ...]

3. Search: Find closest centroids, search only those clusters
   Query → Find 10 closest centroids → Search ~10,000 vectors
   (Instead of all 10M vectors)

IVF Parameters

nlist - Number of clusters. More = finer partitioning, slower build
nprobe - Number of clusters to search. More = better recall, slower search

# IVF in FAISS
import faiss

dimension = 768
nlist = 1000  # Number of clusters

# Create IVF index
quantizer = faiss.IndexFlatL2(dimension)  # Used to find closest clusters
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Must train before adding vectors
index.train(embeddings)  # Learn cluster centroids
index.add(embeddings)

# Set search parameter
index.nprobe = 50  # Search 50 clusters

D, I = index.search(query, k=10)

Product Quantization (PQ)

PQ compresses vectors to reduce memory while maintaining search quality.

How PQ Works

Original vector: 768 dimensions × 4 bytes = 3072 bytes

PQ Compression:
1. Split vector into M subspaces (e.g., M=8)
   [d1...d96] [d97...d192] ... [d673...d768]
   
2. Learn codebook for each subspace (256 codes each)
   Subspace 1: 256 representative vectors
   Subspace 2: 256 representative vectors
   ...

3. Replace each subspace with code ID (1 byte)
   [code_1] [code_2] ... [code_8] = 8 bytes

Compression ratio: 3072 / 8 = 384x!

# IVF with PQ in FAISS
import faiss

dimension = 768
nlist = 1000
m = 8  # Number of subquantizers
nbits = 8  # Bits per code (256 codes)

index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
index.train(embeddings)
index.add(embeddings)

# Much smaller memory footprint!
# 10M vectors: 768 × 4 × 10M = 30GB → 8 × 10M = 80MB

LSH (Locality Sensitive Hashing)

LSH uses hash functions that map similar vectors to the same buckets.

How LSH Works

Key property: Similar vectors → Same hash bucket (with high probability)

1. Create random hyperplanes
2. Hash vector based on which side of each hyperplane it falls
3. Similar vectors likely on same side → same bucket

Example:
v1 = [0.5, 0.8, 0.3] → hash = "10110" → bucket 22
v2 = [0.6, 0.75, 0.35] → hash = "10110" → bucket 22  (similar!)
v3 = [-0.5, -0.8, 0.9] → hash = "01001" → bucket 9   (different)

LSH is less accurate than HNSW/IVF but very fast for certain use cases.

Comparing ANN Algorithms

Algorithm	Speed	Accuracy	Memory	Best For
HNSW	Very Fast	Excellent	High	General purpose, production
IVF	Fast	Good	Medium	Large datasets, memory-constrained
IVF-PQ	Fast	Moderate	Very Low	Billion+ scale, limited memory
LSH	Very Fast	Moderate	Low	Duplicate detection, streaming

Tuning for Your Use Case

# Guidelines for HNSW tuning

def recommend_hnsw_params(
    num_vectors: int,
    target_recall: float = 0.95,
    query_latency_budget_ms: float = 10
) -> dict:
    """Recommend HNSW parameters based on requirements."""
    
    if num_vectors < 100_000:
        return {"M": 16, "ef_construction": 100, "ef_search": 50}
    
    elif num_vectors < 1_000_000:
        if target_recall > 0.98:
            return {"M": 32, "ef_construction": 200, "ef_search": 100}
        else:
            return {"M": 16, "ef_construction": 100, "ef_search": 50}
    
    else:  # 1M+ vectors
        if query_latency_budget_ms > 50:
            # Quality priority
            return {"M": 48, "ef_construction": 400, "ef_search": 200}
        else:
            # Speed priority
            return {"M": 24, "ef_construction": 200, "ef_search": 100}

# Rule of thumb:
# ef_search should be >= k (number of results you want)
# Higher M = more memory but better recall
# ef_construction only matters at build time

Evaluating ANN Quality

def evaluate_ann_recall(
    index_approx,
    index_exact,
    test_queries: np.ndarray,
    k: int = 10
) -> float:
    """
    Calculate recall: what fraction of true top-k are found?
    """
    total_recall = 0
    
    for query in test_queries:
        # Ground truth from exact search
        _, true_neighbors = index_exact.search(query.reshape(1, -1), k)
        true_set = set(true_neighbors[0])
        
        # Approximate search results
        _, approx_neighbors = index_approx.search(query.reshape(1, -1), k)
        approx_set = set(approx_neighbors[0])
        
        # Recall for this query
        recall = len(true_set & approx_set) / k
        total_recall += recall
    
    return total_recall / len(test_queries)

# Typical results:
# HNSW with good params: 95-99% recall
# IVF with nprobe=50: 90-95% recall
# IVF-PQ: 85-95% recall

Memory Estimation

def estimate_memory_gb(
    num_vectors: int,
    dimensions: int,
    index_type: str = "hnsw",
    M: int = 16
) -> float:
    """Estimate memory usage for different index types."""
    
    # Base vector storage (float32)
    vector_bytes = num_vectors * dimensions * 4
    
    if index_type == "flat":
        # Just vectors
        return vector_bytes / (1024**3)
    
    elif index_type == "hnsw":
        # Vectors + graph links
        # Each node has M connections on layer 0, M/2 on higher layers
        # Roughly 2*M links per node, 8 bytes per link
        graph_bytes = num_vectors * 2 * M * 8
        return (vector_bytes + graph_bytes) / (1024**3)
    
    elif index_type == "ivf_pq":
        # Compressed vectors (e.g., 64 bytes per vector with m=64)
        m = 64  # subquantizers
        compressed_bytes = num_vectors * m
        # Plus codebook
        codebook_bytes = 256 * (dimensions // m) * 4 * m
        return (compressed_bytes + codebook_bytes) / (1024**3)

# Example: 10M vectors, 768 dimensions
print(estimate_memory_gb(10_000_000, 768, "flat"))     # ~28.6 GB
print(estimate_memory_gb(10_000_000, 768, "hnsw", 32)) # ~31.0 GB
print(estimate_memory_gb(10_000_000, 768, "ivf_pq"))   # ~0.6 GB

Key Takeaways

HNSW is the go-to algorithm for most production use cases
IVF is good when you need to balance accuracy and memory
PQ enables billion-scale search with limited memory
M and ef_search are the key parameters to tune for HNSW
Higher recall = more compute - tune based on requirements
Measure recall against exact search to validate quality

In the next lesson, we'll explore distance metrics in more detail and when to use each.

Vector Databases & Embeddings