Vector Databases & Embeddings

0 of 12 lessons completed

Vector Search and Similarity

Vector search finds the most similar vectors to a query vector. This lesson covers the fundamentals of similarity search, different distance metrics, and when to use each.

The Nearest Neighbor Problem

Given a query vector q and a set of vectors V, find the k vectors in V that are most similar to q.

Query: q = [0.2, 0.5, -0.1, 0.8]

Database vectors:
v1 = [0.3, 0.4, -0.2, 0.7]  → distance = 0.15 ✓ (close)
v2 = [0.9, -0.3, 0.5, 0.1]  → distance = 1.24
v3 = [0.25, 0.48, -0.12, 0.78] → distance = 0.05 ✓ (closest)
...

Return top-k closest vectors

Distance and Similarity Metrics

1. Euclidean Distance (L2)

Straight-line distance between two points in space.

import numpy as np

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    L2 distance: sqrt(Σ(a_i - b_i)²)
    
    - Range: [0, ∞)
    - Lower is more similar
    """
    return np.sqrt(np.sum((a - b) ** 2))

# Alternative using numpy
distance = np.linalg.norm(a - b)

# Example
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(euclidean_distance(a, b))  # 5.196

When to use: When absolute magnitude matters. Good for un-normalized embeddings.

2. Cosine Similarity

Measures the angle between two vectors, ignoring magnitude.

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    cos(θ) = (a · b) / (||a|| × ||b||)
    
    - Range: [-1, 1]
    - 1 = identical direction
    - 0 = orthogonal (unrelated)
    - -1 = opposite direction
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
    """Cosine distance = 1 - cosine_similarity"""
    return 1 - cosine_similarity(a, b)

# Example
a = np.array([1, 2, 3])
b = np.array([2, 4, 6])  # Same direction, different magnitude
print(cosine_similarity(a, b))  # 1.0 (identical direction)

c = np.array([1, 0, 0])
d = np.array([0, 1, 0])  # Orthogonal
print(cosine_similarity(c, d))  # 0.0

When to use: Most common for text embeddings. Recommended default for RAG.

3. Dot Product (Inner Product)

Simple multiplication of corresponding elements, summed.

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    """
    a · b = Σ(a_i × b_i)
    
    - Range: (-∞, ∞)
    - Higher is more similar
    - Affected by magnitude
    """
    return np.dot(a, b)

# Key insight: For NORMALIZED vectors:
# cosine_similarity(a, b) = dot_product(a, b)

a_normalized = a / np.linalg.norm(a)
b_normalized = b / np.linalg.norm(b)

print(cosine_similarity(a, b))  # Same as...
print(dot_product(a_normalized, b_normalized))  # ...this!

When to use: When vectors are pre-normalized. Faster than cosine (no normalization needed at query time).

4. Manhattan Distance (L1)

def manhattan_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    L1 distance: Σ|a_i - b_i|
    
    - Range: [0, ∞)
    - Also called "taxicab" or "city block" distance
    """
    return np.sum(np.abs(a - b))

When to use: High-dimensional sparse data. Less common for embeddings.

Comparison of Distance Metrics

MetricRangeMagnitude SensitiveBest For
Cosine[-1, 1]NoText embeddings (default)
Dot Product(-∞, ∞)YesNormalized vectors, speed
Euclidean (L2)[0, ∞)YesClustering, image features
Manhattan (L1)[0, ∞)YesSparse, high-dimensional

Normalization for Efficient Search

def normalize_vector(v: np.ndarray) -> np.ndarray:
    """L2 normalize a vector to unit length."""
    norm = np.linalg.norm(v)
    if norm == 0:
        return v
    return v / norm

# Pre-normalize all vectors at indexing time
normalized_embeddings = np.array([normalize_vector(e) for e in embeddings])

# At query time, only need dot product (faster!)
query = normalize_vector(query_embedding)
similarities = np.dot(normalized_embeddings, query)

# Top-k
top_k_indices = np.argsort(similarities)[-k:][::-1]

Exact vs Approximate Nearest Neighbor Search

Exact Search (Brute Force)

def exact_knn(query: np.ndarray, vectors: np.ndarray, k: int) -> list:
    """
    Compute distance to ALL vectors, return top k.
    
    Complexity: O(n × d)
    n = number of vectors
    d = dimensionality
    """
    distances = np.linalg.norm(vectors - query, axis=1)
    return np.argsort(distances)[:k]

# For 1M vectors, 768 dimensions:
# 1M × 768 = 768M operations per query
# ~100-500ms per query (not scalable!)

Approximate Nearest Neighbor (ANN)

Trade exact accuracy for speed. Returns neighbors that are "probably" in the true top-k.

Trade-off:
- Exact KNN: 100% recall, O(n) time
- ANN: 95-99% recall, O(log n) time

For 1M vectors:
- Exact: ~200ms
- ANN (HNSW): ~1-5ms

That's 40-200x speedup!

Similarity Search in Practice

# Using FAISS (Facebook AI Similarity Search)
import faiss

# Create index
dimension = 768
index = faiss.IndexFlatL2(dimension)  # Exact search

# Add vectors
index.add(embeddings)  # numpy array of shape (n, dimension)

# Search
k = 5
query = np.array([query_embedding], dtype='float32')
distances, indices = index.search(query, k)

# indices: [[23, 142, 7, 891, 45]]  # Top-k indices
# distances: [[0.12, 0.18, 0.23, 0.31, 0.35]]  # Distances

# For approximate search (much faster)
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
index.train(embeddings)  # Required for IVF
index.add(embeddings)

# Set search parameters
index.nprobe = 10  # Number of clusters to search
distances, indices = index.search(query, k)

Maximum Marginal Relevance (MMR)

MMR balances relevance with diversity to avoid redundant results.

def mmr(
    query_embedding: np.ndarray,
    candidate_embeddings: np.ndarray,
    k: int = 5,
    lambda_param: float = 0.5
) -> list[int]:
    """
    Maximum Marginal Relevance.
    
    MMR = λ × sim(query, doc) - (1-λ) × max(sim(doc, selected))
    
    λ = 1: pure relevance
    λ = 0: pure diversity
    λ = 0.5: balanced (recommended)
    """
    # Normalize for cosine similarity
    query = query_embedding / np.linalg.norm(query_embedding)
    candidates = candidate_embeddings / np.linalg.norm(
        candidate_embeddings, axis=1, keepdims=True
    )
    
    # Query similarities
    query_sims = np.dot(candidates, query)
    
    selected = []
    remaining = list(range(len(candidates)))
    
    for _ in range(k):
        if not remaining:
            break
        
        if not selected:
            # First: most similar to query
            idx = remaining[np.argmax(query_sims[remaining])]
        else:
            # MMR selection
            scores = []
            for idx in remaining:
                relevance = query_sims[idx]
                # Max similarity to already selected
                redundancy = max(np.dot(candidates[idx], candidates[s]) for s in selected)
                mmr_score = lambda_param * relevance - (1 - lambda_param) * redundancy
                scores.append((idx, mmr_score))
            
            idx = max(scores, key=lambda x: x[1])[0]
        
        selected.append(idx)
        remaining.remove(idx)
    
    return selected

Filtered Search

Combine vector similarity with metadata filters.

# Two approaches:

# 1. Pre-filtering: Filter first, then search
#    - Faster when filter is very selective
#    - May return fewer than k results

# 2. Post-filtering: Search first, then filter
#    - Always returns results
#    - May need to search more initially

# Most vector DBs support pre-filtering:
results = collection.query(
    query_embedding=query,
    top_k=10,
    where={
        "category": "technical",
        "date": {"$gte": "2024-01-01"}
    }
)

# Hybrid approach (Qdrant example):
from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="documents",
    query_vector=query,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="technical"))
        ]
    ),
    limit=10
)

Batch Search

# Search multiple queries at once (more efficient)

# FAISS
queries = np.array([q1, q2, q3], dtype='float32')  # Shape: (3, 768)
distances, indices = index.search(queries, k=5)
# indices shape: (3, 5) - 5 results for each of 3 queries

# Pinecone
results = index.query(
    queries=[
        {"id": "q1", "values": q1},
        {"id": "q2", "values": q2}
    ],
    top_k=5
)

Key Takeaways

  • Cosine similarity is the default for text embeddings
  • Normalize vectors at index time for faster search via dot product
  • ANN algorithms provide 95-99% recall at 40-200x speed improvement
  • MMR balances relevance with diversity
  • Metadata filtering combines similarity with structured queries
  • Batch queries are more efficient than individual queries

In the next lesson, we'll dive deep into indexing algorithms (HNSW, IVF, LSH) that make fast approximate search possible.