Retrieval is the heart of any RAG system. The quality of retrieved documents directly impacts the quality of generated responses. In this lesson, we'll explore the three main retrieval paradigms: lexical, semantic, and hybrid retrieval.
Each retrieval approach has distinct strengths and weaknesses:
Lexical retrieval methods match documents based on exact word occurrences. They're fast, interpretable, and require no ML models.
TF-IDF weights terms by their frequency in a document relative to their frequency across all documents.
TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)
Where:
TF(t, d) = frequency of term t in document d
IDF(t, D) = log(N / df(t))
N = total number of documents
df(t) = number of documents containing term tBM25 is the industry-standard lexical retrieval algorithm, used by Elasticsearch, Lucene, and most search engines. It improves on TF-IDF with saturation and length normalization.
BM25(d, q) = Σ IDF(t) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × |d|/avgdl))
Where:
tf = term frequency in document d
|d| = document length
avgdl = average document length
k1 = term frequency saturation (typically 1.2-2.0)
b = length normalization (typically 0.75)from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Deep learning uses neural networks with many layers.",
"Natural language processing enables computers to understand text.",
"Transformers revolutionized NLP with attention mechanisms."
]
# Tokenize documents
tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]
# Create BM25 index
bm25 = BM25Okapi(tokenized_docs)
# Query
query = "neural networks for language"
tokenized_query = word_tokenize(query.lower())
# Get scores
scores = bm25.get_scores(tokenized_query)
# Rank documents
ranked = sorted(zip(scores, documents), reverse=True)
for score, doc in ranked:
print(f"Score: {score:.4f} - {doc}")Semantic retrieval uses dense vector embeddings to capture meaning, enabling matching based on conceptual similarity rather than exact words.
Both queries and documents are encoded into dense vectors in a shared embedding space. Documents close to the query vector are considered relevant.
Embedding models (bi-encoders) convert text to fixed-dimensional vectors:
from sentence_transformers import SentenceTransformer
import numpy as np
# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Deep learning uses neural networks with many layers.",
"Natural language processing enables computers to understand text.",
"Transformers revolutionized NLP with attention mechanisms."
]
# Encode documents
doc_embeddings = model.encode(documents)
# Query
query = "How do computers understand human language?"
query_embedding = model.encode(query)
# Calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
# Rank documents
ranked = sorted(zip(similarities, documents), reverse=True)
for sim, doc in ranked:
print(f"Similarity: {sim:.4f} - {doc}")Embed and retrieve small chunks, but return expanded context (surrounding sentences/paragraphs).
Build a hierarchy of chunks (document → section → paragraph → sentence). If multiple child chunks are retrieved, automatically merge to the parent.
Prepend contextual information (document title, section headers, summary) to each chunk before embedding. This helps the embedding capture the chunk's role in the broader document.
# Standard chunk
chunk = "The system uses 256-bit encryption."
# Contextual chunk (better for retrieval)
contextual_chunk = """
Document: Security Architecture Guide
Section: Data Protection
Chapter: Encryption Standards
The system uses 256-bit encryption.
"""Exact nearest neighbor search is O(n) - too slow for large datasets. ANN algorithms trade accuracy for speed:
Hybrid retrieval combines the strengths of both approaches, providing comprehensive coverage. This is the recommended approach for production systems.
Run both retrievers in parallel, then combine results using score fusion:
# Linear score fusion
final_score = alpha * semantic_score + (1 - alpha) * lexical_score
# Where alpha is typically tuned (0.5 is a good starting point)RRF combines ranked lists without requiring score normalization:
def reciprocal_rank_fusion(ranked_lists, k=60):
"""
Combine multiple ranked lists using RRF.
RRF_score(d) = Σ 1 / (k + rank(d))
Args:
ranked_lists: List of lists, each containing (doc_id, score) tuples
k: Ranking constant (typically 60)
Returns:
Combined ranking
"""
doc_scores = {}
for ranked_list in ranked_lists:
for rank, (doc_id, _) in enumerate(ranked_list, 1):
if doc_id not in doc_scores:
doc_scores[doc_id] = 0
doc_scores[doc_id] += 1 / (k + rank)
# Sort by combined score
return sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)The dominant production architecture: use BM25 for fast initial retrieval, then re-rank top-k with a semantic model.
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# Create semantic retriever
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Create lexical retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
# Combine with ensemble
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, semantic_retriever],
weights=[0.5, 0.5] # Equal weighting
)
# Retrieve
results = hybrid_retriever.invoke("What is machine learning?")| Query Type | Lexical Only | Semantic Only | Hybrid |
|---|---|---|---|
| "What is ML?" | ❌ Misses "machine learning" | ✅ Understands abbreviation | ✅ |
| "Error code E-1234" | ✅ Exact match | ❌ May not find specific code | ✅ |
| "automobile safety" | ❌ Misses "car" docs | ✅ Finds related | ✅ |
| "John Smith contact" | ✅ Exact name match | ❌ Names are OOV | ✅ |
Beyond content matching, metadata filtering allows you to constrain results based on document properties.
# Query with metadata filter
results = vectorstore.similarity_search(
query="security best practices",
k=10,
filter={
"department": "engineering",
"date": {"$gte": "2024-01-01"},
"document_type": {"$in": ["policy", "guideline"]}
}
)Different queries benefit from different strategies:
To tune retrieval, you need evaluation datasets and metrics:
# Evaluation set format
eval_set = [
{
"query": "What is the refund policy?",
"relevant_doc_ids": ["doc_42", "doc_108"], # Ground truth
},
{
"query": "How do I reset my password?",
"relevant_doc_ids": ["doc_7"],
},
# ... more examples
]
def evaluate_retriever(retriever, eval_set, k=10):
recall_scores = []
mrr_scores = []
for item in eval_set:
results = retriever.retrieve(item["query"], k=k)
retrieved_ids = [r.id for r in results]
# Recall@K
relevant_found = len(set(retrieved_ids) & set(item["relevant_doc_ids"]))
recall = relevant_found / len(item["relevant_doc_ids"])
recall_scores.append(recall)
# MRR
for rank, doc_id in enumerate(retrieved_ids, 1):
if doc_id in item["relevant_doc_ids"]:
mrr_scores.append(1 / rank)
break
else:
mrr_scores.append(0)
return {
"recall@k": sum(recall_scores) / len(recall_scores),
"mrr": sum(mrr_scores) / len(mrr_scores)
}In the next module, we'll dive into building RAG pipelines, starting with document processing and chunking strategies.