In this lesson, we'll dive deep into each core component of a RAG system, understanding their role, implementation considerations, and how they work together to create intelligent retrieval-augmented applications.
A production RAG system is composed of several interconnected components organized into two main pipelines:
Document loaders are responsible for extracting content from various data sources and converting them into a unified format for processing.
from langchain_community.document_loaders import (
PyPDFLoader,
WebBaseLoader,
NotionDirectoryLoader,
CSVLoader
)
# Load PDF documents
pdf_loader = PyPDFLoader("documents/report.pdf")
pdf_docs = pdf_loader.load()
# Load web pages
web_loader = WebBaseLoader(["https://example.com/docs"])
web_docs = web_loader.load()
# Load from Notion
notion_loader = NotionDirectoryLoader("notion_exports/")
notion_docs = notion_loader.load()
# Each document has: page_content + metadata
for doc in pdf_docs:
print(f"Content: {doc.page_content[:100]}...")
print(f"Metadata: {doc.metadata}")Text splitters divide large documents into smaller, semantically coherent chunks. This is crucial because:
Splits text by character or token count with optional overlap.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000, # Max characters per chunk
chunk_overlap=200, # Overlap between chunks
separator="\n\n" # Split on double newlines first
)
chunks = splitter.split_documents(documents)Tries multiple separators hierarchically (paragraphs → sentences → words) to find natural breaks.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)Uses embeddings to identify semantic boundaries, grouping related sentences together.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = splitter.split_documents(documents)Respects document structure (Markdown headers, HTML tags, code blocks).
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split)
chunks = splitter.split_text(markdown_document)The optimal chunk size depends on several factors:
Best Practice: Start with 500-1000 characters with 10-20% overlap, then tune based on evaluation metrics.
Embedding models convert text into dense vector representations that capture semantic meaning. Similar concepts are positioned close together in vector space.
| Model | Dimensions | Max Tokens | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8191 | Best quality, API-based |
| OpenAI text-embedding-3-small | 1536 | 8191 | Good balance of quality/cost |
| Cohere embed-v3 | 1024 | 512 | Multilingual, query/doc modes |
| BGE-large-en-v1.5 | 1024 | 512 | Open source, high quality |
| E5-large-v2 | 1024 | 512 | Prefix-based (query:/passage:) |
| all-MiniLM-L6-v2 | 384 | 256 | Fast, lightweight, good for prototyping |
For RAG, sentence-level embeddings (also called dense passage embeddings) are preferred because they:
from langchain_openai import OpenAIEmbeddings
from sentence_transformers import SentenceTransformer
# Using OpenAI
openai_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectors = openai_embeddings.embed_documents(["Hello world", "Goodbye world"])
# Using Sentence Transformers (local)
model = SentenceTransformer('all-MiniLM-L6-v2')
vectors = model.encode(["Hello world", "Goodbye world"])
# Each vector is a list of floats
print(f"Vector dimension: {len(vectors[0])}") # 384 for MiniLMVector stores are specialized databases optimized for storing and searching high-dimensional vectors. They use approximate nearest neighbor (ANN) algorithms for fast similarity search.
import chromadb
from chromadb.utils import embedding_functions
# Initialize client
client = chromadb.PersistentClient(path="./chroma_db")
# Create embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
model_name="text-embedding-3-small"
)
# Create or get collection
collection = client.get_or_create_collection(
name="documents",
embedding_function=openai_ef,
metadata={"hnsw:space": "cosine"} # Distance metric
)
# Add documents
collection.add(
documents=["Document 1 content", "Document 2 content"],
metadatas=[{"source": "web"}, {"source": "pdf"}],
ids=["doc1", "doc2"]
)
# Query
results = collection.query(
query_texts=["What is the content?"],
n_results=5,
where={"source": "web"} # Metadata filter
)Retrievers are responsible for finding relevant documents given a query. They abstract the retrieval logic and can combine multiple strategies.
Re-rankers are models that reorder retrieved results to improve precision. They typically use cross-encoder architectures that jointly process query-document pairs.
from sentence_transformers import CrossEncoder
# Initialize reranker
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Documents retrieved from vector store
query = "What is machine learning?"
retrieved_docs = [
"Machine learning is a subset of AI...",
"The weather today is sunny...",
"ML algorithms learn from data..."
]
# Score each query-document pair
pairs = [[query, doc] for doc in retrieved_docs]
scores = reranker.predict(pairs)
# Sort by score (descending)
ranked = sorted(zip(scores, retrieved_docs), reverse=True)
for score, doc in ranked:
print(f"Score: {score:.4f} - {doc[:50]}...")The LLM generates the final response using retrieved context. It synthesizes information from multiple sources into a coherent answer.
RAG_PROMPT = """You are a helpful assistant. Answer the user's question
based ONLY on the following context. If the context doesn't contain
enough information to answer, say "I don't have enough information."
Context:
{context}
Question: {question}
Instructions:
- Use only information from the provided context
- Cite sources using [1], [2], etc.
- If unsure, acknowledge uncertainty
- Be concise but complete
Answer:"""The orchestrator coordinates all components, managing the flow from query to response. Popular orchestration frameworks include:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Initialize components
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="./db", embedding_function=embeddings)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Create retriever
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance for diversity
search_kwargs={"k": 5}
)
# Create prompt
prompt = PromptTemplate(
template=RAG_PROMPT,
input_variables=["context", "question"]
)
# Create chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # Simple context stuffing
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)
# Query
result = qa_chain.invoke({"query": "What is RAG?"})
print(result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])Here's how all components work together in a production RAG system:
┌─────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE (Offline) │
├─────────────────────────────────────────────────────────────────┤
│ Documents → Loader → Splitter → Embeddings → Vector Store │
│ (PDF,Web) (Parse) (Chunk) (Vectorize) (Index) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE (Online) │
├─────────────────────────────────────────────────────────────────┤
│ User Query → Query Enhancement → Retriever → Re-ranker │
│ → Context Assembly → LLM → Response + Citations │
└─────────────────────────────────────────────────────────────────┘In the next lesson, we'll dive deeper into retrieval strategies, exploring lexical, semantic, and hybrid approaches in detail.