Vector Databases & Embeddings

0 of 12 lessons completed

Sentence and Document Embeddings

While word embeddings capture individual word meanings, RAG systems need to compare entire sentences or documents. Sentence transformers are models specifically trained to produce meaningful embeddings for longer text sequences.

From Words to Sentences

How do we get sentence embeddings from word embeddings?

Naive Approaches (Don't Work Well)

# Approach 1: Average word embeddings
sentence = "The quick brown fox"
word_embeddings = [embed(word) for word in sentence.split()]
sentence_embedding = np.mean(word_embeddings, axis=0)
# Problem: "Dog bites man" ≈ "Man bites dog" (same words!)

# Approach 2: Use BERT [CLS] token directly
sentence_embedding = bert_model(sentence)[0][0]  # First token
# Problem: BERT wasn't trained for this! [CLS] is for classification.

The Solution: Sentence Transformers

Sentence transformers are transformer models fine-tuned specifically to produce sentence embeddings where similar sentences have similar vectors.

Sentence-BERT (SBERT)

The breakthrough paper (Reimers & Gurevych, 2019) that made sentence embeddings practical.

Architecture

Siamese Network Architecture:
                                                    
Sentence A ──→ [BERT] ──→ Pooling ──→ Embedding A ─┐
                                                    ├─→ Similarity
Sentence B ──→ [BERT] ──→ Pooling ──→ Embedding B ─┘

Same BERT weights for both (Siamese = shared weights)
Pooling: Usually mean of token embeddings

Training Objective

# Contrastive Learning: Pull similar sentences together, push dissimilar apart

# Multiple Negatives Ranking Loss (common for retrieval)
def mnrl_loss(query_emb, positive_emb, negative_embs, temperature=0.05):
    """
    Maximize similarity with positive, minimize with negatives.
    Uses in-batch negatives (other positives in batch are negatives).
    """
    pos_score = cosine_sim(query_emb, positive_emb) / temperature
    neg_scores = [cosine_sim(query_emb, neg) / temperature for neg in negative_embs]
    
    # Softmax cross-entropy
    return -pos_score + log(exp(pos_score) + sum(exp(neg) for neg in neg_scores))

# Training data examples:
# (query, positive_passage) pairs from:
# - NLI datasets (premise-entailment pairs)
# - Paraphrase datasets
# - QA datasets (question-answer pairs)
# - Search logs (query-clicked_result pairs)

Pooling Strategies

# Given BERT output: [batch, seq_len, hidden_dim]

# 1. Mean Pooling (most common for retrieval)
def mean_pooling(token_embeddings, attention_mask):
    # Mask padding tokens
    mask = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
    sum_embeddings = (token_embeddings * mask).sum(1)
    sum_mask = mask.sum(1).clamp(min=1e-9)
    return sum_embeddings / sum_mask

# 2. CLS Pooling
def cls_pooling(token_embeddings):
    return token_embeddings[:, 0]  # First token

# 3. Max Pooling
def max_pooling(token_embeddings, attention_mask):
    mask = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
    token_embeddings[mask == 0] = -1e9
    return torch.max(token_embeddings, dim=1)[0]

Using Sentence Transformers

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode sentences
sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",
    "Machine learning is fascinating",
]

embeddings = model.encode(sentences)
print(f"Shape: {embeddings.shape}")  # (3, 384)

# Calculate similarities
sim_matrix = cosine_similarity(embeddings)
print("Similarity between sentence 0 and 1:", sim_matrix[0, 1])  # ~0.82
print("Similarity between sentence 0 and 2:", sim_matrix[0, 2])  # ~0.15

# Encode with normalization (for dot product similarity)
embeddings = model.encode(sentences, normalize_embeddings=True)

# Batch encoding for efficiency
large_corpus = ["..." for _ in range(10000)]
embeddings = model.encode(
    large_corpus,
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
    device='cuda'  # Use GPU
)

Popular Sentence Embedding Models

ModelDimSpeedQuality
all-MiniLM-L6-v2384Very FastGood (prototyping)
all-mpnet-base-v2768MediumVery Good
BAAI/bge-large-en-v1.51024SlowerExcellent (MTEB top)
intfloat/e5-large-v21024SlowerExcellent
OpenAI text-embedding-3-small1536APIExcellent
OpenAI text-embedding-3-large3072APIBest (general)

Asymmetric vs Symmetric Embeddings

Symmetric (Sentence Similarity)

# Both inputs are of the same type (sentence-to-sentence)
# Example: Finding paraphrases, duplicate detection

model = SentenceTransformer('all-MiniLM-L6-v2')

sentence_a = "How do I create a new account?"
sentence_b = "What are the steps to register?"

emb_a = model.encode(sentence_a)
emb_b = model.encode(sentence_b)
# Compare directly

Asymmetric (Query-Document Search)

# Different types: short query vs long document
# Example: RAG, semantic search

# Models like E5 and BGE use prefixes:

# E5 model
query = "query: How do I reset my password?"
document = "passage: To reset your password, go to Settings, then..."

# BGE model uses instruction for queries
instruction = "Represent this sentence for searching relevant passages: "
query = instruction + "How do I reset my password?"
document = "To reset your password, go to Settings, then..."

# Cohere uses input_type parameter
import cohere
co = cohere.Client()

query_emb = co.embed(
    texts=["How do I reset my password?"],
    model="embed-english-v3.0",
    input_type="search_query"
).embeddings[0]

doc_emb = co.embed(
    texts=["To reset your password, go to Settings..."],
    model="embed-english-v3.0",
    input_type="search_document"
).embeddings[0]

Document Embeddings

For longer documents, we need strategies to handle context length limits.

Strategy 1: Truncation

# Simple but loses information from the end
model = SentenceTransformer('all-MiniLM-L6-v2')
# Max tokens: 256 for MiniLM

long_document = "..." * 10000  # Very long
embedding = model.encode(long_document)  # Truncates to first 256 tokens

Strategy 2: Chunking and Averaging

def embed_long_document(text: str, model, chunk_size: int = 256) -> np.ndarray:
    """Embed long document by chunking and averaging."""
    
    # Split into chunks
    words = text.split()
    chunks = [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
    
    # Embed each chunk
    chunk_embeddings = model.encode(chunks)
    
    # Average (weighted by chunk length)
    weights = [len(chunk.split()) for chunk in chunks]
    weighted_avg = np.average(chunk_embeddings, axis=0, weights=weights)
    
    return weighted_avg / np.linalg.norm(weighted_avg)  # Normalize

Strategy 3: Long-Context Models

# Some models support longer contexts:
# - jina-embeddings-v2-base-en: 8192 tokens
# - NV-Embed: 32K tokens
# - OpenAI text-embedding-3: 8191 tokens

# Use when document fits within limit
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en')
embedding = model.encode(long_document)  # Handles up to 8K tokens

Strategy 4: Store Chunks, Not Documents

# Most common for RAG: chunk documents and store each chunk
# This gives better retrieval granularity

def process_document(doc: str, doc_id: str, chunk_size: int = 500):
    chunks = chunk_document(doc, chunk_size)
    
    embeddings = model.encode(chunks)
    
    return [
        {
            "id": f"{doc_id}_chunk_{i}",
            "embedding": emb,
            "text": chunk,
            "metadata": {"doc_id": doc_id, "chunk_idx": i}
        }
        for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
    ]

Multimodal Embeddings

Some models can embed multiple modalities (text, images) into the same space.

CLIP (OpenAI)

from sentence_transformers import SentenceTransformer
from PIL import Image

# Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

# Embed text
text_embedding = model.encode("a photo of a cat")

# Embed image
image = Image.open("cat.jpg")
image_embedding = model.encode(image)

# These are in the SAME vector space!
# Can search images with text queries

Fine-tuning Sentence Transformers

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Prepare training data (query, positive_doc pairs)
train_examples = [
    InputExample(texts=["What is RAG?", "RAG stands for Retrieval Augmented Generation..."]),
    InputExample(texts=["How does chunking work?", "Chunking divides documents into smaller..."]),
    # ... more examples
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Use MultipleNegativesRankingLoss for retrieval
train_loss = losses.MultipleNegativesRankingLoss(model)

# Fine-tune
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./fine_tuned_model"
)

# Evaluate on your domain
model = SentenceTransformer('./fine_tuned_model')

Evaluating Embedding Models

Use MTEB (Massive Text Embedding Benchmark) to compare models:

from mteb import MTEB
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Run on retrieval tasks
evaluation = MTEB(tasks=["MSMARCOv2", "NFCorpus", "SciFact"])
results = evaluation.run(model, eval_splits=["test"])

Check the MTEB leaderboard at huggingface.co/spaces/mteb/leaderboard for latest rankings.

Key Takeaways

  • Sentence transformers are specifically trained for sentence-level similarity
  • Contrastive learning trains by pulling similar sentences together
  • Mean pooling of token embeddings is the common approach
  • Asymmetric models (BGE, E5) are best for query-document retrieval
  • Chunk long documents for better retrieval in RAG systems
  • Check MTEB leaderboard for model comparisons
  • Fine-tune on your domain for best results

In the next lesson, we'll explore vector search and similarity metrics in detail.