Vector Databases & Embeddings

0 of 12 lessons completed

Indexing Algorithms: ANN and HNSW

Approximate Nearest Neighbor (ANN) algorithms are the secret sauce that makes vector search practical at scale. This lesson explores the major indexing algorithms: HNSW, IVF, LSH, and Product Quantization.

Why We Need ANN Algorithms

Exact search: Check every vector
┌─────────────────────────────────────────────┐
│ Query → Compare with ALL 10M vectors → Sort │
│ Time: O(n × d) = O(10M × 768) ≈ 1 second   │
└─────────────────────────────────────────────┘

ANN: Smart data structures
┌─────────────────────────────────────────────┐
│ Query → Navigate index → Check ~1000 vectors│
│ Time: O(log n × d) ≈ 1-10 ms               │
└─────────────────────────────────────────────┘

100-1000x speedup with 95-99% accuracy!

HNSW (Hierarchical Navigable Small World)

HNSW is the most popular ANN algorithm, used by Pinecone, Weaviate, Qdrant, and others. It combines ideas from skip lists and navigable small-world graphs.

Key Insight: Small World Networks

In a small-world graph, you can reach any node from any other node in just a few hops (like "six degrees of separation").

HNSW Structure

Multi-layer graph structure:

Layer 2 (sparse):    A ←──────→ F
                      \         /
                       \       /
Layer 1 (medium):    A ← B ← C ← D ← E ← F
                      \   |   |   |   /
                       \  |   |   |  /
Layer 0 (dense):     A-B-C-D-E-F-G-H-I-J-K-L

- Higher layers: Long-distance connections (express lanes)
- Lower layers: Short-distance connections (local roads)
- Layer 0: All nodes with local connections

Search Algorithm

def hnsw_search(query, graph, k, ef):
    """
    HNSW search algorithm.
    
    Args:
        query: Query vector
        graph: HNSW graph structure
        k: Number of results
        ef: Search breadth (exploration factor)
    """
    # Start at top layer with entry point
    current_node = graph.entry_point
    
    # Navigate through layers top-to-bottom
    for layer in range(graph.max_layer, 0, -1):
        # Greedy search: move to closest neighbor
        while True:
            neighbors = graph.get_neighbors(current_node, layer)
            closest = min(neighbors, key=lambda n: distance(query, n))
            
            if distance(query, closest) >= distance(query, current_node):
                break  # Local minimum found
            current_node = closest
    
    # At layer 0: do beam search with ef candidates
    candidates = [current_node]
    visited = {current_node}
    results = []
    
    while candidates:
        node = candidates.pop()
        
        for neighbor in graph.get_neighbors(node, 0):
            if neighbor not in visited:
                visited.add(neighbor)
                candidates.append(neighbor)
                results.append((distance(query, neighbor), neighbor))
                
            if len(visited) >= ef:
                break
    
    # Return top k
    return sorted(results)[:k]

HNSW Parameters

ParameterEffectTypical Values
MMax connections per node. Higher = better recall, more memory16-64
ef_constructionSearch breadth during index building. Higher = better quality, slower build100-400
ef_searchSearch breadth at query time. Higher = better recall, slower query50-200
# HNSW in FAISS
import faiss

# Build index
dimension = 768
M = 32  # Connections per node
index = faiss.IndexHNSWFlat(dimension, M)

# Set construction parameter
index.hnsw.efConstruction = 200

# Add vectors
index.add(embeddings)

# Set search parameter (can change at query time)
index.hnsw.efSearch = 100

# Search
D, I = index.search(query, k=10)

IVF (Inverted File Index)

IVF clusters vectors and only searches within relevant clusters.

How IVF Works

1. Training Phase: Cluster vectors using k-means
   ┌─────────────────────────────────────────────┐
   │ 10M vectors → k-means with 1000 centroids   │
   │ Each vector assigned to nearest centroid    │
   └─────────────────────────────────────────────┘

2. Indexing: Store vectors in cluster buckets
   Cluster 1: [v23, v145, v892, ...]
   Cluster 2: [v7, v89, v234, ...]
   ...
   Cluster 1000: [v12, v567, ...]

3. Search: Find closest centroids, search only those clusters
   Query → Find 10 closest centroids → Search ~10,000 vectors
   (Instead of all 10M vectors)

IVF Parameters

  • nlist - Number of clusters. More = finer partitioning, slower build
  • nprobe - Number of clusters to search. More = better recall, slower search
# IVF in FAISS
import faiss

dimension = 768
nlist = 1000  # Number of clusters

# Create IVF index
quantizer = faiss.IndexFlatL2(dimension)  # Used to find closest clusters
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Must train before adding vectors
index.train(embeddings)  # Learn cluster centroids
index.add(embeddings)

# Set search parameter
index.nprobe = 50  # Search 50 clusters

D, I = index.search(query, k=10)

Product Quantization (PQ)

PQ compresses vectors to reduce memory while maintaining search quality.

How PQ Works

Original vector: 768 dimensions × 4 bytes = 3072 bytes

PQ Compression:
1. Split vector into M subspaces (e.g., M=8)
   [d1...d96] [d97...d192] ... [d673...d768]
   
2. Learn codebook for each subspace (256 codes each)
   Subspace 1: 256 representative vectors
   Subspace 2: 256 representative vectors
   ...

3. Replace each subspace with code ID (1 byte)
   [code_1] [code_2] ... [code_8] = 8 bytes

Compression ratio: 3072 / 8 = 384x!
# IVF with PQ in FAISS
import faiss

dimension = 768
nlist = 1000
m = 8  # Number of subquantizers
nbits = 8  # Bits per code (256 codes)

index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
index.train(embeddings)
index.add(embeddings)

# Much smaller memory footprint!
# 10M vectors: 768 × 4 × 10M = 30GB → 8 × 10M = 80MB

LSH (Locality Sensitive Hashing)

LSH uses hash functions that map similar vectors to the same buckets.

How LSH Works

Key property: Similar vectors → Same hash bucket (with high probability)

1. Create random hyperplanes
2. Hash vector based on which side of each hyperplane it falls
3. Similar vectors likely on same side → same bucket

Example:
v1 = [0.5, 0.8, 0.3] → hash = "10110" → bucket 22
v2 = [0.6, 0.75, 0.35] → hash = "10110" → bucket 22  (similar!)
v3 = [-0.5, -0.8, 0.9] → hash = "01001" → bucket 9   (different)

LSH is less accurate than HNSW/IVF but very fast for certain use cases.

Comparing ANN Algorithms

AlgorithmSpeedAccuracyMemoryBest For
HNSWVery FastExcellentHighGeneral purpose, production
IVFFastGoodMediumLarge datasets, memory-constrained
IVF-PQFastModerateVery LowBillion+ scale, limited memory
LSHVery FastModerateLowDuplicate detection, streaming

Tuning for Your Use Case

# Guidelines for HNSW tuning

def recommend_hnsw_params(
    num_vectors: int,
    target_recall: float = 0.95,
    query_latency_budget_ms: float = 10
) -> dict:
    """Recommend HNSW parameters based on requirements."""
    
    if num_vectors < 100_000:
        return {"M": 16, "ef_construction": 100, "ef_search": 50}
    
    elif num_vectors < 1_000_000:
        if target_recall > 0.98:
            return {"M": 32, "ef_construction": 200, "ef_search": 100}
        else:
            return {"M": 16, "ef_construction": 100, "ef_search": 50}
    
    else:  # 1M+ vectors
        if query_latency_budget_ms > 50:
            # Quality priority
            return {"M": 48, "ef_construction": 400, "ef_search": 200}
        else:
            # Speed priority
            return {"M": 24, "ef_construction": 200, "ef_search": 100}

# Rule of thumb:
# ef_search should be >= k (number of results you want)
# Higher M = more memory but better recall
# ef_construction only matters at build time

Evaluating ANN Quality

def evaluate_ann_recall(
    index_approx,
    index_exact,
    test_queries: np.ndarray,
    k: int = 10
) -> float:
    """
    Calculate recall: what fraction of true top-k are found?
    """
    total_recall = 0
    
    for query in test_queries:
        # Ground truth from exact search
        _, true_neighbors = index_exact.search(query.reshape(1, -1), k)
        true_set = set(true_neighbors[0])
        
        # Approximate search results
        _, approx_neighbors = index_approx.search(query.reshape(1, -1), k)
        approx_set = set(approx_neighbors[0])
        
        # Recall for this query
        recall = len(true_set & approx_set) / k
        total_recall += recall
    
    return total_recall / len(test_queries)

# Typical results:
# HNSW with good params: 95-99% recall
# IVF with nprobe=50: 90-95% recall
# IVF-PQ: 85-95% recall

Memory Estimation

def estimate_memory_gb(
    num_vectors: int,
    dimensions: int,
    index_type: str = "hnsw",
    M: int = 16
) -> float:
    """Estimate memory usage for different index types."""
    
    # Base vector storage (float32)
    vector_bytes = num_vectors * dimensions * 4
    
    if index_type == "flat":
        # Just vectors
        return vector_bytes / (1024**3)
    
    elif index_type == "hnsw":
        # Vectors + graph links
        # Each node has M connections on layer 0, M/2 on higher layers
        # Roughly 2*M links per node, 8 bytes per link
        graph_bytes = num_vectors * 2 * M * 8
        return (vector_bytes + graph_bytes) / (1024**3)
    
    elif index_type == "ivf_pq":
        # Compressed vectors (e.g., 64 bytes per vector with m=64)
        m = 64  # subquantizers
        compressed_bytes = num_vectors * m
        # Plus codebook
        codebook_bytes = 256 * (dimensions // m) * 4 * m
        return (compressed_bytes + codebook_bytes) / (1024**3)

# Example: 10M vectors, 768 dimensions
print(estimate_memory_gb(10_000_000, 768, "flat"))     # ~28.6 GB
print(estimate_memory_gb(10_000_000, 768, "hnsw", 32)) # ~31.0 GB
print(estimate_memory_gb(10_000_000, 768, "ivf_pq"))   # ~0.6 GB

Key Takeaways

  • HNSW is the go-to algorithm for most production use cases
  • IVF is good when you need to balance accuracy and memory
  • PQ enables billion-scale search with limited memory
  • M and ef_search are the key parameters to tune for HNSW
  • Higher recall = more compute - tune based on requirements
  • Measure recall against exact search to validate quality

In the next lesson, we'll explore distance metrics in more detail and when to use each.