Understanding Embeddings

Vector Search and Similarity

Vector search finds the most similar vectors to a query vector. This lesson covers the fundamentals of similarity search, different distance metrics, and when to use each.

The Nearest Neighbor Problem

Given a query vector q and a set of vectors V, find the k vectors in V that are most similar to q.

Query: q = [0.2, 0.5, -0.1, 0.8]

Database vectors:
v1 = [0.3, 0.4, -0.2, 0.7]  → distance = 0.15 ✓ (close)
v2 = [0.9, -0.3, 0.5, 0.1]  → distance = 1.24
v3 = [0.25, 0.48, -0.12, 0.78] → distance = 0.05 ✓ (closest)
...

Return top-k closest vectors

Distance and Similarity Metrics

1. Euclidean Distance (L2)

Straight-line distance between two points in space.

import numpy as np

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    L2 distance: sqrt(Σ(a_i - b_i)²)
    
    - Range: [0, ∞)
    - Lower is more similar
    """
    return np.sqrt(np.sum((a - b) ** 2))

# Alternative using numpy
distance = np.linalg.norm(a - b)

# Example
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(euclidean_distance(a, b))  # 5.196

When to use: When absolute magnitude matters. Good for un-normalized embeddings.

2. Cosine Similarity

Measures the angle between two vectors, ignoring magnitude.

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    cos(θ) = (a · b) / (||a|| × ||b||)
    
    - Range: [-1, 1]
    - 1 = identical direction
    - 0 = orthogonal (unrelated)
    - -1 = opposite direction
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
    """Cosine distance = 1 - cosine_similarity"""
    return 1 - cosine_similarity(a, b)

# Example
a = np.array([1, 2, 3])
b = np.array([2, 4, 6])  # Same direction, different magnitude
print(cosine_similarity(a, b))  # 1.0 (identical direction)

c = np.array([1, 0, 0])
d = np.array([0, 1, 0])  # Orthogonal
print(cosine_similarity(c, d))  # 0.0

When to use: Most common for text embeddings. Recommended default for RAG.

3. Dot Product (Inner Product)

Simple multiplication of corresponding elements, summed.

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    """
    a · b = Σ(a_i × b_i)
    
    - Range: (-∞, ∞)
    - Higher is more similar
    - Affected by magnitude
    """
    return np.dot(a, b)

# Key insight: For NORMALIZED vectors:
# cosine_similarity(a, b) = dot_product(a, b)

a_normalized = a / np.linalg.norm(a)
b_normalized = b / np.linalg.norm(b)

print(cosine_similarity(a, b))  # Same as...
print(dot_product(a_normalized, b_normalized))  # ...this!

When to use: When vectors are pre-normalized. Faster than cosine (no normalization needed at query time).

4. Manhattan Distance (L1)

def manhattan_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    L1 distance: Σ|a_i - b_i|
    
    - Range: [0, ∞)
    - Also called "taxicab" or "city block" distance
    """
    return np.sum(np.abs(a - b))

When to use: High-dimensional sparse data. Less common for embeddings.

Comparison of Distance Metrics

Metric	Range	Magnitude Sensitive	Best For
Cosine	[-1, 1]	No	Text embeddings (default)
Dot Product	(-∞, ∞)	Yes	Normalized vectors, speed
Euclidean (L2)	[0, ∞)	Yes	Clustering, image features
Manhattan (L1)	[0, ∞)	Yes	Sparse, high-dimensional

Normalization for Efficient Search

def normalize_vector(v: np.ndarray) -> np.ndarray:
    """L2 normalize a vector to unit length."""
    norm = np.linalg.norm(v)
    if norm == 0:
        return v
    return v / norm

# Pre-normalize all vectors at indexing time
normalized_embeddings = np.array([normalize_vector(e) for e in embeddings])

# At query time, only need dot product (faster!)
query = normalize_vector(query_embedding)
similarities = np.dot(normalized_embeddings, query)

# Top-k
top_k_indices = np.argsort(similarities)[-k:][::-1]

Exact vs Approximate Nearest Neighbor Search

Exact Search (Brute Force)

def exact_knn(query: np.ndarray, vectors: np.ndarray, k: int) -> list:
    """
    Compute distance to ALL vectors, return top k.
    
    Complexity: O(n × d)
    n = number of vectors
    d = dimensionality
    """
    distances = np.linalg.norm(vectors - query, axis=1)
    return np.argsort(distances)[:k]

# For 1M vectors, 768 dimensions:
# 1M × 768 = 768M operations per query
# ~100-500ms per query (not scalable!)

Approximate Nearest Neighbor (ANN)

Trade exact accuracy for speed. Returns neighbors that are "probably" in the true top-k.

Trade-off:
- Exact KNN: 100% recall, O(n) time
- ANN: 95-99% recall, O(log n) time

For 1M vectors:
- Exact: ~200ms
- ANN (HNSW): ~1-5ms

That's 40-200x speedup!

Similarity Search in Practice

# Using FAISS (Facebook AI Similarity Search)
import faiss

# Create index
dimension = 768
index = faiss.IndexFlatL2(dimension)  # Exact search

# Add vectors
index.add(embeddings)  # numpy array of shape (n, dimension)

# Search
k = 5
query = np.array([query_embedding], dtype='float32')
distances, indices = index.search(query, k)

# indices: [[23, 142, 7, 891, 45]]  # Top-k indices
# distances: [[0.12, 0.18, 0.23, 0.31, 0.35]]  # Distances

# For approximate search (much faster)
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
index.train(embeddings)  # Required for IVF
index.add(embeddings)

# Set search parameters
index.nprobe = 10  # Number of clusters to search
distances, indices = index.search(query, k)

Maximum Marginal Relevance (MMR)

MMR balances relevance with diversity to avoid redundant results.

def mmr(
    query_embedding: np.ndarray,
    candidate_embeddings: np.ndarray,
    k: int = 5,
    lambda_param: float = 0.5
) -> list[int]:
    """
    Maximum Marginal Relevance.
    
    MMR = λ × sim(query, doc) - (1-λ) × max(sim(doc, selected))
    
    λ = 1: pure relevance
    λ = 0: pure diversity
    λ = 0.5: balanced (recommended)
    """
    # Normalize for cosine similarity
    query = query_embedding / np.linalg.norm(query_embedding)
    candidates = candidate_embeddings / np.linalg.norm(
        candidate_embeddings, axis=1, keepdims=True
    )
    
    # Query similarities
    query_sims = np.dot(candidates, query)
    
    selected = []
    remaining = list(range(len(candidates)))
    
    for _ in range(k):
        if not remaining:
            break
        
        if not selected:
            # First: most similar to query
            idx = remaining[np.argmax(query_sims[remaining])]
        else:
            # MMR selection
            scores = []
            for idx in remaining:
                relevance = query_sims[idx]
                # Max similarity to already selected
                redundancy = max(np.dot(candidates[idx], candidates[s]) for s in selected)
                mmr_score = lambda_param * relevance - (1 - lambda_param) * redundancy
                scores.append((idx, mmr_score))
            
            idx = max(scores, key=lambda x: x[1])[0]
        
        selected.append(idx)
        remaining.remove(idx)
    
    return selected

Filtered Search

Combine vector similarity with metadata filters.

# Two approaches:

# 1. Pre-filtering: Filter first, then search
#    - Faster when filter is very selective
#    - May return fewer than k results

# 2. Post-filtering: Search first, then filter
#    - Always returns results
#    - May need to search more initially

# Most vector DBs support pre-filtering:
results = collection.query(
    query_embedding=query,
    top_k=10,
    where={
        "category": "technical",
        "date": {"$gte": "2024-01-01"}
    }
)

# Hybrid approach (Qdrant example):
from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="documents",
    query_vector=query,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="technical"))
        ]
    ),
    limit=10
)

Batch Search

# Search multiple queries at once (more efficient)

# FAISS
queries = np.array([q1, q2, q3], dtype='float32')  # Shape: (3, 768)
distances, indices = index.search(queries, k=5)
# indices shape: (3, 5) - 5 results for each of 3 queries

# Pinecone
results = index.query(
    queries=[
        {"id": "q1", "values": q1},
        {"id": "q2", "values": q2}
    ],
    top_k=5
)

Key Takeaways

Cosine similarity is the default for text embeddings
Normalize vectors at index time for faster search via dot product
ANN algorithms provide 95-99% recall at 40-200x speed improvement
MMR balances relevance with diversity
Metadata filtering combines similarity with structured queries
Batch queries are more efficient than individual queries

In the next lesson, we'll dive deep into indexing algorithms (HNSW, IVF, LSH) that make fast approximate search possible.

Vector Databases & Embeddings