RAG Systems

0 of 13 lessons completed

Embedding Generation and Storage

Embeddings are the bridge between human-readable text and machine-processable vectors. In this lesson, we'll deep dive into embedding models, how to choose the right one, and best practices for generation and storage.

What are Embeddings?

Embeddings are dense vector representations of text that capture semantic meaning. Similar concepts map to nearby points in vector space, enabling similarity-based search.

Properties of Good Embeddings

  • Semantic Similarity - Similar meanings → close vectors
  • Compositionality - Capture meaning of phrases, not just words
  • Generalization - Work across different domains and phrasings
  • Discriminative - Different concepts should be far apart

Types of Text Embeddings

1. Word Embeddings (Legacy)

Each word gets a single vector, regardless of context:

  • Word2Vec - Skip-gram and CBOW architectures
  • GloVe - Global vectors from co-occurrence statistics
  • FastText - Subword embeddings for handling OOV words

Limitation: "bank" has the same embedding whether it means "river bank" or "money bank".

2. Token-Level Contextual Embeddings

Each token gets a different embedding based on surrounding context:

  • BERT - Bidirectional context from masked language modeling
  • RoBERTa - Optimized BERT training
  • GPT - Unidirectional (left-to-right) context

These produce per-token embeddings; for RAG, we need sentence/document-level vectors.

3. Sentence Embeddings (For RAG)

Sentence transformers are specifically trained to produce meaningful sentence-level representations:

  • Single vector per text sequence (sentence, paragraph, or document)
  • Trained on sentence similarity datasets (NLI, STS)
  • Optimized for semantic similarity comparison

Sentence Transformers: Training and Architecture

Bi-Encoder Architecture

Bi-encoders (the standard for RAG) encode query and document independently:

Query → Encoder → Query Vector ──┐
                                    ├── Similarity Score
Document → Encoder → Doc Vector ──┘

Advantage: Documents can be pre-embedded and cached
Disadvantage: No direct query-document interaction

Training Process

Sentence transformers are typically trained with contrastive learning:

  1. Collect pairs - (query, positive_doc) and (query, negative_doc)
  2. Encode separately - Get embeddings for query and documents
  3. Contrastive loss - Push positive pairs together, negative pairs apart
# Contrastive loss example
def contrastive_loss(query_emb, pos_emb, neg_embs, temperature=0.05):
    # Positive similarity
    pos_sim = cosine_similarity(query_emb, pos_emb) / temperature
    
    # Negative similarities
    neg_sims = [cosine_similarity(query_emb, neg) / temperature for neg in neg_embs]
    
    # InfoNCE loss
    numerator = torch.exp(pos_sim)
    denominator = numerator + sum(torch.exp(neg) for neg in neg_sims)
    
    return -torch.log(numerator / denominator)

Popular Embedding Models for RAG

Proprietary Models (API-based)

ModelDimensionsMax TokensCost
OpenAI text-embedding-3-large30728191$0.13/1M tokens
OpenAI text-embedding-3-small15368191$0.02/1M tokens
Cohere embed-v31024512$0.10/1M tokens
Voyage AI voyage-210244000$0.10/1M tokens

Open Source Models (Self-hosted)

ModelDimensionsParametersNotes
BGE-large-en-v1.51024335MTop MTEB performer, instruction-tuned
E5-large-v21024335MPrefix-based (query:/passage:)
GTE-large1024335MStrong multilingual support
all-MiniLM-L6-v238422MFast, lightweight, good for prototyping
jina-embeddings-v2768137M8K context length, good for late chunking

Model Selection Criteria

1. Quality (Benchmark Performance)

Check MTEB (Massive Text Embedding Benchmark) leaderboard for retrieval tasks:

  • MTEB Retrieval - Average across retrieval datasets
  • MS MARCO - Standard passage retrieval benchmark
  • BEIR - Zero-shot retrieval across diverse domains

2. Domain Fit

  • General purpose - OpenAI, BGE, E5 work well across domains
  • Scientific/Medical - Consider PubMedBERT, BioLinkBERT embeddings
  • Legal - Legal-BERT or domain-fine-tuned models
  • Code - CodeBERT, StarCoder embeddings

3. Practical Constraints

  • Latency - Smaller models (MiniLM) for real-time applications
  • Cost - Open source for high-volume; API for convenience
  • Privacy - Self-hosted if data cannot leave your infrastructure
  • Context Length - Jina, NV-Embed for long documents

Generating Embeddings

Using OpenAI

from openai import OpenAI

client = OpenAI()

def get_embeddings(texts: list[str], model: str = "text-embedding-3-small"):
    response = client.embeddings.create(
        model=model,
        input=texts,
        encoding_format="float"  # or "base64" for efficiency
    )
    return [item.embedding for item in response.data]

# Batch for efficiency (max 2048 inputs per request)
embeddings = get_embeddings(["Hello world", "Goodbye world"])

Using Sentence Transformers (Local)

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# For BGE models, add instruction prefix for queries
def embed_query(query: str):
    instruction = "Represent this sentence for searching relevant passages: "
    return model.encode(instruction + query)

def embed_documents(documents: list[str]):
    # Documents don't need instruction prefix
    return model.encode(documents, show_progress_bar=True)

# Batch processing with GPU
embeddings = model.encode(
    documents,
    batch_size=64,
    device="cuda",
    convert_to_tensor=True,
    normalize_embeddings=True  # L2 normalize for cosine similarity
)

Using LangChain

from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

# OpenAI
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Hugging Face (local)
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={'device': 'cuda'},
    encode_kwargs={'normalize_embeddings': True}
)

# Embed
vectors = embeddings.embed_documents(documents)

Query vs Document Embeddings

Some models use different encoding strategies for queries vs documents:

Asymmetric Embedding

# E5 models use prefixes
query = "query: What is machine learning?"
document = "passage: Machine learning is a subset of AI..."

# BGE models use instruction for queries
query_instruction = "Represent this sentence for searching relevant passages: "
query_embedding = model.encode(query_instruction + "What is machine learning?")
doc_embedding = model.encode("Machine learning is a subset of AI...")

# Cohere has explicit input_type parameter
import cohere
co = cohere.Client()

# For documents
doc_embeddings = co.embed(
    texts=documents,
    model="embed-english-v3.0",
    input_type="search_document"
).embeddings

# For queries
query_embedding = co.embed(
    texts=[query],
    model="embed-english-v3.0",
    input_type="search_query"
).embeddings[0]

Embedding Optimization

1. Dimensionality Reduction

OpenAI's embedding-3 models support dimension reduction via the 'dimensions' parameter:

# Reduce from 3072 to 512 dimensions
response = client.embeddings.create(
    model="text-embedding-3-large",
    input=texts,
    dimensions=512  # Reduce dimensionality
)

Trade-off: Lower dimensions = faster search + less storage, but slightly lower quality.

2. Normalization

Always L2-normalize embeddings when using cosine similarity:

import numpy as np

def normalize(embeddings):
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / norms

# With normalized vectors, cosine similarity = dot product
similarity = np.dot(query_embedding, document_embedding)

3. Batching for Efficiency

def batch_embed(texts: list[str], batch_size: int = 100):
    """Embed texts in batches to manage memory and API limits."""
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embeddings = model.encode(batch)
        all_embeddings.extend(embeddings)
    
    return all_embeddings

Storing Embeddings

Once generated, embeddings need to be stored efficiently for retrieval.

Vector Database Options

# ChromaDB (embedded, development)
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("documents")

collection.add(
    embeddings=embeddings,
    documents=documents,
    metadatas=[{"source": "web"} for _ in documents],
    ids=[f"doc_{i}" for i in range(len(documents))]
)

# Pinecone (managed, production)
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("my-index")

index.upsert(
    vectors=[
        {"id": f"doc_{i}", "values": emb, "metadata": {"source": "web"}}
        for i, emb in enumerate(embeddings)
    ]
)

# Qdrant (self-hosted, production)
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient("localhost", port=6333)
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)

client.upsert(
    collection_name="documents",
    points=[
        {"id": i, "vector": emb, "payload": {"text": doc}}
        for i, (emb, doc) in enumerate(zip(embeddings, documents))
    ]
)

Embedding Caching

Cache embeddings to avoid regenerating for the same content:

import hashlib
import json
from pathlib import Path

class EmbeddingCache:
    def __init__(self, cache_dir: str = ".embedding_cache"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
    
    def _get_key(self, text: str, model: str) -> str:
        content = f"{model}:{text}"
        return hashlib.md5(content.encode()).hexdigest()
    
    def get(self, text: str, model: str) -> list | None:
        key = self._get_key(text, model)
        cache_file = self.cache_dir / f"{key}.json"
        
        if cache_file.exists():
            return json.loads(cache_file.read_text())
        return None
    
    def set(self, text: str, model: str, embedding: list):
        key = self._get_key(text, model)
        cache_file = self.cache_dir / f"{key}.json"
        cache_file.write_text(json.dumps(embedding))
    
    def get_or_compute(self, text: str, model: str, compute_fn):
        cached = self.get(text, model)
        if cached is not None:
            return cached
        
        embedding = compute_fn(text)
        self.set(text, model, embedding)
        return embedding

Fine-Tuning Embeddings (Advanced)

For domain-specific applications, you can fine-tune embedding models on your data:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Prepare training data (query, positive_doc, negative_doc)
train_examples = [
    InputExample(texts=["What is RAG?", "RAG combines retrieval and generation..."]),
    InputExample(texts=["How does chunking work?", "Chunking splits documents into..."]),
    # ... more examples
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Use MultipleNegativesRankingLoss
train_loss = losses.MultipleNegativesRankingLoss(model)

# Fine-tune
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./fine_tuned_embeddings"
)

Multimodal Embeddings

For RAG over images, tables, and mixed content:

  • CLIP - Text and image in shared embedding space
  • OpenAI Vision Embeddings - Describe images, then embed descriptions
  • ColPali - Direct document image embeddings (no OCR needed)
# CLIP for image embeddings
from sentence_transformers import SentenceTransformer
from PIL import Image

model = SentenceTransformer('clip-ViT-B-32')

# Embed text
text_embedding = model.encode("a photo of a cat")

# Embed image
image = Image.open("cat.jpg")
image_embedding = model.encode(image)

# These are in the same vector space - can compare directly!
similarity = cosine_similarity(text_embedding, image_embedding)

Key Takeaways

  • Use sentence embeddings for RAG, not word embeddings
  • Check MTEB leaderboard when selecting embedding models
  • Use asymmetric encoding (query vs document) when the model supports it
  • Normalize embeddings for cosine similarity
  • Batch embedding generation for efficiency
  • Cache embeddings to avoid recomputation
  • Consider fine-tuning for domain-specific applications
  • Match embedding dimensions to your latency and quality requirements

In the next lesson, we'll explore implementing the retrieval layer, including vector store operations and search patterns.