RAG Architecture

Retrieval Strategies and Techniques

Retrieval is the heart of any RAG system. The quality of retrieved documents directly impacts the quality of generated responses. In this lesson, we'll explore the three main retrieval paradigms: lexical, semantic, and hybrid retrieval.

Overview of Retrieval Paradigms

Each retrieval approach has distinct strengths and weaknesses:

Lexical Retrieval - Matches exact words and terms; precise but misses semantic relationships
Semantic Retrieval - Matches meaning via embeddings; handles synonyms but may miss exact terms
Hybrid Retrieval - Combines both for comprehensive coverage

Lexical Retrieval

Lexical retrieval methods match documents based on exact word occurrences. They're fast, interpretable, and require no ML models.

Core Assumptions

Documents containing query terms are more relevant
Rare terms are more discriminative than common terms
Term frequency correlates with relevance (with diminishing returns)

TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF weights terms by their frequency in a document relative to their frequency across all documents.

Formula

TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)

Where:
  TF(t, d) = frequency of term t in document d
  IDF(t, D) = log(N / df(t))
  N = total number of documents
  df(t) = number of documents containing term t

Strengths of TF-IDF

Simple and interpretable
Fast to compute and query
Effective for exact term matching
No training required

Limitations of TF-IDF

No understanding of synonyms or related concepts
Term frequency can dominate long documents
Sensitive to vocabulary mismatch

BM25 (Best Matching 25)

BM25 is the industry-standard lexical retrieval algorithm, used by Elasticsearch, Lucene, and most search engines. It improves on TF-IDF with saturation and length normalization.

Key Improvements over TF-IDF

Term Frequency Saturation - Diminishing returns for repeated terms (controlled by parameter k1)
Document Length Normalization - Adjusts for document length (controlled by parameter b)
Better Probabilistic Foundation - Based on probabilistic relevance framework

BM25 Formula

BM25(d, q) = Σ IDF(t) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × |d|/avgdl))

Where:
  tf = term frequency in document d
  |d| = document length
  avgdl = average document length
  k1 = term frequency saturation (typically 1.2-2.0)
  b = length normalization (typically 0.75)

Code Example: BM25 Retrieval

from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize

# Sample documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing enables computers to understand text.",
    "Transformers revolutionized NLP with attention mechanisms."
]

# Tokenize documents
tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]

# Create BM25 index
bm25 = BM25Okapi(tokenized_docs)

# Query
query = "neural networks for language"
tokenized_query = word_tokenize(query.lower())

# Get scores
scores = bm25.get_scores(tokenized_query)

# Rank documents
ranked = sorted(zip(scores, documents), reverse=True)
for score, doc in ranked:
    print(f"Score: {score:.4f} - {doc}")

Advantages of Lexical Retrieval

Exact Matching - Guaranteed to find documents with query terms
No Training Required - Works out of the box
Interpretable - Easy to understand why documents matched
Fast - Inverted indices enable sub-millisecond queries
Handles Rare Terms - Excellent for names, product IDs, technical terms

Limitations of Lexical Retrieval

Vocabulary Mismatch - "car" won't match "automobile"
No Semantic Understanding - Can't understand intent or meaning
Sensitive to Phrasing - Different phrasings may not match

Semantic Retrieval

Semantic retrieval uses dense vector embeddings to capture meaning, enabling matching based on conceptual similarity rather than exact words.

Core Idea

Both queries and documents are encoded into dense vectors in a shared embedding space. Documents close to the query vector are considered relevant.

Vector Encoding

Embedding models (bi-encoders) convert text to fixed-dimensional vectors:

Query embedding: q = embed(query)
Document embedding: d = embed(document)
Similarity: sim(q, d) = cosine_similarity(q, d) or dot_product(q, d)

Semantic Matching and Retrieval

from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing enables computers to understand text.",
    "Transformers revolutionized NLP with attention mechanisms."
]

# Encode documents
doc_embeddings = model.encode(documents)

# Query
query = "How do computers understand human language?"
query_embedding = model.encode(query)

# Calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

# Rank documents
ranked = sorted(zip(similarities, documents), reverse=True)
for sim, doc in ranked:
    print(f"Similarity: {sim:.4f} - {doc}")

Advanced Semantic Retrieval Approaches

1. Sentence-Window Retrieval (Small-to-Large)

Embed and retrieve small chunks, but return expanded context (surrounding sentences/paragraphs).

Benefit - Precise retrieval with rich context for generation
Implementation - Store chunk → parent mapping, expand after retrieval

2. Auto-merging Retriever (Hierarchical)

Build a hierarchy of chunks (document → section → paragraph → sentence). If multiple child chunks are retrieved, automatically merge to the parent.

3. Contextual Retrieval (Anthropic)

Prepend contextual information (document title, section headers, summary) to each chunk before embedding. This helps the embedding capture the chunk's role in the broader document.

# Standard chunk
chunk = "The system uses 256-bit encryption."

# Contextual chunk (better for retrieval)
contextual_chunk = """
Document: Security Architecture Guide
Section: Data Protection
Chapter: Encryption Standards

The system uses 256-bit encryption.
"""

Using Approximate Nearest Neighbors (ANN)

Exact nearest neighbor search is O(n) - too slow for large datasets. ANN algorithms trade accuracy for speed:

HNSW (Hierarchical Navigable Small World) - Graph-based, best for most use cases
IVF (Inverted File Index) - Clustering-based, good with large datasets
LSH (Locality Sensitive Hashing) - Hash-based, fast but lower accuracy
Product Quantization (PQ) - Compression for memory efficiency

Advantages of Semantic Retrieval

Semantic Understanding - "car" matches "automobile", "vehicle"
Paraphrase Handling - Different phrasings, same meaning
Multilingual - Cross-language retrieval possible with multilingual models
Intent Understanding - Captures user intent beyond keywords

Challenges of Semantic Retrieval

Rare Terms - May not surface exact matches for specific terms, names, or IDs
Out-of-Domain - Embeddings may not generalize to specialized domains
Computational Cost - Embedding generation and storage require resources
Black Box - Harder to interpret why documents matched

Hybrid Retrieval (Lexical + Semantic)

Hybrid retrieval combines the strengths of both approaches, providing comprehensive coverage. This is the recommended approach for production systems.

Why Hybrid Retrieval is Necessary

Lexical catches exact matches - Names, product codes, technical terms
Semantic catches conceptual matches - Paraphrases, related concepts
Together they maximize recall - Fewer relevant documents missed

Common Hybrid Retrieval Architectures

1. Parallel Retrieval + Score Fusion

Run both retrievers in parallel, then combine results using score fusion:

# Linear score fusion
final_score = alpha * semantic_score + (1 - alpha) * lexical_score

# Where alpha is typically tuned (0.5 is a good starting point)

2. Reciprocal Rank Fusion (RRF)

RRF combines ranked lists without requiring score normalization:

def reciprocal_rank_fusion(ranked_lists, k=60):
    """
    Combine multiple ranked lists using RRF.
    
    RRF_score(d) = Σ 1 / (k + rank(d))
    
    Args:
        ranked_lists: List of lists, each containing (doc_id, score) tuples
        k: Ranking constant (typically 60)
    
    Returns:
        Combined ranking
    """
    doc_scores = {}
    
    for ranked_list in ranked_lists:
        for rank, (doc_id, _) in enumerate(ranked_list, 1):
            if doc_id not in doc_scores:
                doc_scores[doc_id] = 0
            doc_scores[doc_id] += 1 / (k + rank)
    
    # Sort by combined score
    return sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)

3. Lexical Retrieval + Semantic Re-ranking

The dominant production architecture: use BM25 for fast initial retrieval, then re-rank top-k with a semantic model.

Step 1: BM25 retrieves top 100-500 candidates
Step 2: Cross-encoder re-ranks to find top 5-10

Hybrid Retrieval Implementation

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Create semantic retriever
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Create lexical retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10

# Combine with ensemble
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.5, 0.5]  # Equal weighting
)

# Retrieve
results = hybrid_retriever.invoke("What is machine learning?")

Failure Modes Avoided by Hybrid RAG

Query Type	Lexical Only	Semantic Only	Hybrid
"What is ML?"	❌ Misses "machine learning"	✅ Understands abbreviation	✅
"Error code E-1234"	✅ Exact match	❌ May not find specific code	✅
"automobile safety"	❌ Misses "car" docs	✅ Finds related	✅
"John Smith contact"	✅ Exact name match	❌ Names are OOV	✅

Metadata Filtering

Beyond content matching, metadata filtering allows you to constrain results based on document properties.

Standard Pattern

# Query with metadata filter
results = vectorstore.similarity_search(
    query="security best practices",
    k=10,
    filter={
        "department": "engineering",
        "date": {"$gte": "2024-01-01"},
        "document_type": {"$in": ["policy", "guideline"]}
    }
)

Hard Filters vs. Soft Boosts

Hard Filters (Pre-filtering) - Exclude non-matching documents entirely. Use when filters are mandatory.
Soft Boosts (Re-ranking) - Prefer matching documents but don't exclude others. Use when filters are preferences.

System Tuning

Choosing the Candidate Set Size (K)

Too small (k=5) - May miss relevant documents, especially with diverse topics
Too large (k=100) - Noisy context, slower re-ranking, higher cost
Sweet spot - Typically k=10-20 for final context, k=50-100 for re-ranking input

Latency and Cost Considerations

Lexical retrieval: ~1-10ms
Semantic retrieval: ~10-100ms (depends on index type)
Re-ranking (cross-encoder): ~50-200ms for 100 docs
LLM generation: ~500-3000ms

Query-Aware Tuning

Different queries benefit from different strategies:

Keyword-heavy queries - Weight lexical higher
Natural language questions - Weight semantic higher
Mixed queries - Use balanced hybrid

Evaluation and Tuning Methodology

To tune retrieval, you need evaluation datasets and metrics:

Key Retrieval Metrics

Recall@K - What fraction of relevant docs are in the top K?
Precision@K - What fraction of top K docs are relevant?
MRR (Mean Reciprocal Rank) - Average 1/rank of first relevant doc
NDCG@K - Normalized Discounted Cumulative Gain

Building an Evaluation Set

# Evaluation set format
eval_set = [
    {
        "query": "What is the refund policy?",
        "relevant_doc_ids": ["doc_42", "doc_108"],  # Ground truth
    },
    {
        "query": "How do I reset my password?",
        "relevant_doc_ids": ["doc_7"],
    },
    # ... more examples
]

def evaluate_retriever(retriever, eval_set, k=10):
    recall_scores = []
    mrr_scores = []
    
    for item in eval_set:
        results = retriever.retrieve(item["query"], k=k)
        retrieved_ids = [r.id for r in results]
        
        # Recall@K
        relevant_found = len(set(retrieved_ids) & set(item["relevant_doc_ids"]))
        recall = relevant_found / len(item["relevant_doc_ids"])
        recall_scores.append(recall)
        
        # MRR
        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in item["relevant_doc_ids"]:
                mrr_scores.append(1 / rank)
                break
        else:
            mrr_scores.append(0)
    
    return {
        "recall@k": sum(recall_scores) / len(recall_scores),
        "mrr": sum(mrr_scores) / len(mrr_scores)
    }

Key Takeaways

Lexical retrieval excels at exact matches; use BM25 over TF-IDF
Semantic retrieval captures meaning; choose embedding models based on domain
Hybrid retrieval is the production standard - combine both approaches
RRF is a robust score fusion method that doesn't require normalization
Metadata filtering adds precision when document properties matter
Evaluate systematically using Recall@K, MRR, and NDCG

In the next module, we'll dive into building RAG pipelines, starting with document processing and chunking strategies.

RAG Systems