Vector Databases & Embeddings

0 of 12 lessons completed

Distance Metrics and Similarity Measures

Choosing the right distance metric is crucial for vector search quality. This lesson provides a deep dive into distance metrics, their mathematical properties, and practical guidance on when to use each.

Distance vs Similarity

Distance: Lower = More Similar
  - Euclidean distance: 0 = identical, ∞ = very different
  
Similarity: Higher = More Similar  
  - Cosine similarity: 1 = identical, -1 = opposite
  
Conversion: distance = 1 - similarity (for normalized metrics)

The Big Three: Cosine, Euclidean, Dot Product

1. Cosine Similarity (Recommended for Text)

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Measures the angle between two vectors.
    
    cos(θ) = (a · b) / (||a|| × ||b||)
    
    Properties:
    - Range: [-1, 1]
    - Invariant to vector magnitude (length)
    - 1: identical direction
    - 0: orthogonal (perpendicular)
    - -1: opposite direction
    """
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Cosine distance
def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
    return 1 - cosine_similarity(a, b)

# Example: Magnitude doesn't matter!
v1 = np.array([1, 2, 3])
v2 = np.array([2, 4, 6])  # Same direction, 2x magnitude
v3 = np.array([1, 0, 0])  # Different direction

print(cosine_similarity(v1, v2))  # 1.0 (identical direction!)
print(cosine_similarity(v1, v3))  # 0.27 (different direction)

Use when: Text embeddings, semantic similarity, when magnitude shouldn't affect results.

2. Euclidean Distance (L2)

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    Straight-line distance between points.
    
    d = sqrt(Σ(a_i - b_i)²)
    
    Properties:
    - Range: [0, ∞)
    - Affected by magnitude
    - Intuitive geometric interpretation
    """
    return np.sqrt(np.sum((a - b) ** 2))

# Squared Euclidean (faster, avoids sqrt)
def squared_euclidean(a: np.ndarray, b: np.ndarray) -> float:
    return np.sum((a - b) ** 2)

# Example: Magnitude matters!
v1 = np.array([1, 2, 3])
v2 = np.array([2, 4, 6])
v3 = np.array([1.1, 2.1, 3.1])

print(euclidean_distance(v1, v2))  # 3.74 (far due to magnitude)
print(euclidean_distance(v1, v3))  # 0.17 (close)

Use when: Clustering, image features, when magnitude is meaningful.

3. Dot Product (Inner Product)

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    """
    Sum of element-wise products.
    
    a · b = Σ(a_i × b_i)
    
    Properties:
    - Range: (-∞, ∞)
    - Combines direction AND magnitude
    - For normalized vectors: dot = cosine similarity
    """
    return np.dot(a, b)

# Key insight: Normalize vectors for cosine via dot product
def normalize(v):
    return v / np.linalg.norm(v)

v1_norm = normalize(v1)
v2_norm = normalize(v2)

# These are equivalent:
print(cosine_similarity(v1, v2))  # Using cosine formula
print(dot_product(v1_norm, v2_norm))  # Dot of normalized vectors

Use when: Vectors are pre-normalized, or when magnitude encodes importance.

Relationship Between Metrics

For normalized vectors (||a|| = ||b|| = 1):

cosine_similarity(a, b) = dot_product(a, b)

euclidean_distance(a, b)² = 2 - 2 × cosine_similarity(a, b)
                          = 2 × (1 - dot_product(a, b))
                          = 2 × cosine_distance(a, b)

So for normalized vectors:
- Euclidean and Cosine give the same ranking!
- Dot product is fastest (no normalization at query time)

Proof by Example

import numpy as np

# Generate random normalized vectors
np.random.seed(42)
vectors = np.random.randn(100, 768)
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

query = np.random.randn(768)
query = query / np.linalg.norm(query)

# Compute all three metrics
cosine_sims = np.dot(vectors, query)  # Same as cosine for normalized
euclidean_dists = np.linalg.norm(vectors - query, axis=1)

# Rankings should be identical!
cosine_ranking = np.argsort(-cosine_sims)  # Descending (higher = better)
euclidean_ranking = np.argsort(euclidean_dists)  # Ascending (lower = better)

print("Rankings match:", np.array_equal(cosine_ranking, euclidean_ranking))  # True

Other Distance Metrics

Manhattan Distance (L1)

def manhattan_distance(a: np.ndarray, b: np.ndarray) -> float:
    """
    Sum of absolute differences.
    
    d = Σ|a_i - b_i|
    
    Also called: L1, taxicab, city-block distance
    """
    return np.sum(np.abs(a - b))

Use when: High-dimensional sparse data, robust to outliers.

Hamming Distance

def hamming_distance(a: np.ndarray, b: np.ndarray) -> int:
    """
    Count of positions where elements differ.
    
    Used for binary vectors (0/1).
    """
    return np.sum(a != b)

# Example with binary vectors
a = np.array([1, 0, 1, 1, 0])
b = np.array([1, 1, 1, 0, 0])
print(hamming_distance(a, b))  # 2 (positions 1 and 3 differ)

Use when: Binary hash codes, locality-sensitive hashing.

Jaccard Similarity

def jaccard_similarity(set_a: set, set_b: set) -> float:
    """
    Ratio of intersection to union.
    
    J(A, B) = |A ∩ B| / |A ∪ B|
    """
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    return intersection / union if union > 0 else 0

# For binary vectors
def jaccard_binary(a: np.ndarray, b: np.ndarray) -> float:
    intersection = np.sum(a & b)
    union = np.sum(a | b)
    return intersection / union if union > 0 else 0

Use when: Sparse sets, document similarity, recommendations.

Choosing the Right Metric

Use CaseRecommended MetricWhy
Text embeddings (RAG)Cosine / DotSemantic similarity, magnitude-invariant
OpenAI embeddingsCosineRecommended by OpenAI
Normalized vectorsDot ProductFastest (equivalent to cosine)
Image featuresEuclideanPosition in feature space matters
ClusteringEuclideanNatural for k-means centroids
Sparse high-dimensionalManhattanRobust to outliers
Binary codes (LSH)HammingFast bit operations

Configuring Metrics in Vector Databases

# Pinecone
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone()
pc.create_index(
    name="my-index",
    dimension=768,
    metric="cosine",  # or "euclidean", "dotproduct"
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

# Qdrant
from qdrant_client.models import Distance, VectorParams

client.create_collection(
    collection_name="my_collection",
    vectors_config=VectorParams(
        size=768,
        distance=Distance.COSINE  # or EUCLID, DOT
    )
)

# ChromaDB
collection = client.create_collection(
    name="my_collection",
    metadata={"hnsw:space": "cosine"}  # or "l2", "ip"
)

# Weaviate
{
    "class": "Document",
    "vectorIndexConfig": {
        "distance": "cosine"  # or "l2-squared", "dot"
    }
}

Performance Optimization

# Best practice: Normalize at index time for speed

# At indexing
def add_document(text: str, vector_store):
    embedding = model.encode(text)
    # Normalize once at index time
    normalized = embedding / np.linalg.norm(embedding)
    vector_store.add(normalized)

# At query time - use dot product (faster than cosine)
def search(query: str, vector_store):
    embedding = model.encode(query)
    normalized = embedding / np.linalg.norm(embedding)
    # Dot product = cosine for normalized vectors
    return vector_store.search(normalized, metric="dot")

# Many embedding models normalize by default
embeddings = model.encode(texts, normalize_embeddings=True)

Common Pitfalls

  • Mixing metrics: Don't index with cosine and query with euclidean
  • Forgetting normalization: Dot product requires normalized vectors to match cosine
  • Using wrong direction: Cosine similarity (higher = better) vs distance (lower = better)
  • Not validating: Test that your metric produces expected results

Key Takeaways

  • Cosine is the default for text embeddings and RAG
  • Normalize vectors at index time, use dot product for speed
  • For normalized vectors, cosine, dot, and Euclidean give same rankings
  • Match the metric to how your embeddings were trained
  • Euclidean is better when magnitude is meaningful (clustering, images)

In the next lesson, we'll explore Pinecone, a fully managed vector database.