RAG Systems

0 of 13 lessons completed

Core Components of RAG

In this lesson, we'll dive deep into each core component of a RAG system, understanding their role, implementation considerations, and how they work together to create intelligent retrieval-augmented applications.

RAG Architecture Overview

A production RAG system is composed of several interconnected components organized into two main pipelines:

  • Ingestion Pipeline (Offline) - Processes documents and builds the searchable index
  • Query Pipeline (Online) - Handles user queries and generates responses

Component 1: Document Loaders

Document loaders are responsible for extracting content from various data sources and converting them into a unified format for processing.

Supported Data Sources

  • Files - PDF, DOCX, TXT, Markdown, HTML, CSV, XLSX
  • Web - Web pages, sitemaps, RSS feeds
  • Databases - SQL databases, MongoDB, Elasticsearch
  • APIs - Notion, Confluence, Slack, GitHub
  • Cloud Storage - S3, Google Drive, Azure Blob

Key Considerations

  • Format Preservation - Maintain structure (headings, lists, tables) when possible
  • Metadata Extraction - Capture document properties (title, author, date, source URL)
  • Incremental Loading - Handle updates without reprocessing entire corpus
  • Error Handling - Gracefully handle corrupted or inaccessible documents

Code Example: Loading Documents

from langchain_community.document_loaders import (
    PyPDFLoader,
    WebBaseLoader,
    NotionDirectoryLoader,
    CSVLoader
)

# Load PDF documents
pdf_loader = PyPDFLoader("documents/report.pdf")
pdf_docs = pdf_loader.load()

# Load web pages
web_loader = WebBaseLoader(["https://example.com/docs"])
web_docs = web_loader.load()

# Load from Notion
notion_loader = NotionDirectoryLoader("notion_exports/")
notion_docs = notion_loader.load()

# Each document has: page_content + metadata
for doc in pdf_docs:
    print(f"Content: {doc.page_content[:100]}...")
    print(f"Metadata: {doc.metadata}")

Component 2: Text Splitters (Chunking)

Text splitters divide large documents into smaller, semantically coherent chunks. This is crucial because:

  • Embedding models have token limits (typically 512-8192 tokens)
  • Smaller chunks enable more precise retrieval
  • LLM context windows are limited and expensive to fill

Chunking Strategies

1. Fixed-Size Chunking

Splits text by character or token count with optional overlap.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=1000,      # Max characters per chunk
    chunk_overlap=200,    # Overlap between chunks
    separator="\n\n"      # Split on double newlines first
)
chunks = splitter.split_documents(documents)

2. Recursive Character Splitting

Tries multiple separators hierarchically (paragraphs → sentences → words) to find natural breaks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)

3. Semantic Chunking

Uses embeddings to identify semantic boundaries, grouping related sentences together.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
chunks = splitter.split_documents(documents)

4. Document-Aware Splitting

Respects document structure (Markdown headers, HTML tags, code blocks).

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split)
chunks = splitter.split_text(markdown_document)

Figuring Out Ideal Chunk Size

The optimal chunk size depends on several factors:

  • Embedding Model Limits - Stay within token limits (e.g., 8191 for text-embedding-3-large)
  • Query Type - Short factoid queries favor smaller chunks; complex questions may need larger context
  • Content Type - Technical docs may need larger chunks to preserve context; FAQs work well with small chunks
  • Retrieval Strategy - If using sentence-window retrieval, chunks can be smaller

Best Practice: Start with 500-1000 characters with 10-20% overlap, then tune based on evaluation metrics.

Component 3: Embedding Models

Embedding models convert text into dense vector representations that capture semantic meaning. Similar concepts are positioned close together in vector space.

Types of Embeddings

  • Word Embeddings - Word2Vec, GloVe (one vector per word)
  • Sentence/Document Embeddings - Single vector for entire text sequence
  • Token-Level Embeddings - Contextual vectors for each token (BERT, GPT)

Popular Embedding Models for RAG

ModelDimensionsMax TokensNotes
OpenAI text-embedding-3-large30728191Best quality, API-based
OpenAI text-embedding-3-small15368191Good balance of quality/cost
Cohere embed-v31024512Multilingual, query/doc modes
BGE-large-en-v1.51024512Open source, high quality
E5-large-v21024512Prefix-based (query:/passage:)
all-MiniLM-L6-v2384256Fast, lightweight, good for prototyping

Sentence Embeddings for RAG

For RAG, sentence-level embeddings (also called dense passage embeddings) are preferred because they:

  • Capture the full semantic meaning of text passages
  • Enable direct similarity comparison between queries and documents
  • Are specifically trained for semantic similarity tasks

Code Example: Generating Embeddings

from langchain_openai import OpenAIEmbeddings
from sentence_transformers import SentenceTransformer

# Using OpenAI
openai_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectors = openai_embeddings.embed_documents(["Hello world", "Goodbye world"])

# Using Sentence Transformers (local)
model = SentenceTransformer('all-MiniLM-L6-v2')
vectors = model.encode(["Hello world", "Goodbye world"])

# Each vector is a list of floats
print(f"Vector dimension: {len(vectors[0])}")  # 384 for MiniLM

Component 4: Vector Stores

Vector stores are specialized databases optimized for storing and searching high-dimensional vectors. They use approximate nearest neighbor (ANN) algorithms for fast similarity search.

Key Features

  • Similarity Search - Find vectors closest to a query vector
  • Metadata Filtering - Filter results based on document attributes
  • Hybrid Search - Combine vector and keyword search
  • Scalability - Handle millions or billions of vectors

Popular Vector Databases

  • Pinecone - Fully managed, serverless, enterprise-grade
  • Weaviate - Open source, GraphQL API, multimodal support
  • Qdrant - High-performance, rich filtering, Rust-based
  • ChromaDB - Embedded, developer-friendly, great for prototyping
  • Milvus - Scalable, enterprise-ready, GPU acceleration
  • FAISS - Facebook's library, not a database, but highly optimized
  • pgvector - PostgreSQL extension, use existing infra

Code Example: Using ChromaDB

import chromadb
from chromadb.utils import embedding_functions

# Initialize client
client = chromadb.PersistentClient(path="./chroma_db")

# Create embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    model_name="text-embedding-3-small"
)

# Create or get collection
collection = client.get_or_create_collection(
    name="documents",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}  # Distance metric
)

# Add documents
collection.add(
    documents=["Document 1 content", "Document 2 content"],
    metadatas=[{"source": "web"}, {"source": "pdf"}],
    ids=["doc1", "doc2"]
)

# Query
results = collection.query(
    query_texts=["What is the content?"],
    n_results=5,
    where={"source": "web"}  # Metadata filter
)

Component 5: Retrievers

Retrievers are responsible for finding relevant documents given a query. They abstract the retrieval logic and can combine multiple strategies.

Retrieval Strategies

  • Semantic (Dense) Retrieval - Uses vector similarity to find semantically related documents
  • Lexical (Sparse) Retrieval - Uses keyword matching (BM25, TF-IDF)
  • Hybrid Retrieval - Combines both approaches for better coverage

Advanced Retrieval Patterns

  • Multi-Query Retrieval - Generate multiple query variations, retrieve for each, merge results
  • Parent Document Retrieval - Retrieve chunks but return parent documents
  • Sentence-Window Retrieval - Retrieve small chunks, expand to include surrounding context
  • Contextual Compression - Extract only relevant portions from retrieved documents

Component 6: Re-rankers

Re-rankers are models that reorder retrieved results to improve precision. They typically use cross-encoder architectures that jointly process query-document pairs.

Why Re-ranking Matters

  • Initial retrieval prioritizes recall (finding all relevant docs)
  • Re-ranking optimizes precision (putting best docs first)
  • Cross-encoders are more accurate but slower than bi-encoders
  • Applied to top-k results (typically 20-100) for efficiency

Popular Re-ranking Models

  • Cohere Rerank - API-based, high quality, multilingual
  • BGE-Reranker - Open source, available in various sizes
  • ColBERT - Late interaction model, efficient at scale
  • Cross-Encoder models - ms-marco-MiniLM series

Code Example: Re-ranking

from sentence_transformers import CrossEncoder

# Initialize reranker
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Documents retrieved from vector store
query = "What is machine learning?"
retrieved_docs = [
    "Machine learning is a subset of AI...",
    "The weather today is sunny...",
    "ML algorithms learn from data..."
]

# Score each query-document pair
pairs = [[query, doc] for doc in retrieved_docs]
scores = reranker.predict(pairs)

# Sort by score (descending)
ranked = sorted(zip(scores, retrieved_docs), reverse=True)
for score, doc in ranked:
    print(f"Score: {score:.4f} - {doc[:50]}...")

Component 7: LLM (Generator)

The LLM generates the final response using retrieved context. It synthesizes information from multiple sources into a coherent answer.

Key Considerations

  • Context Window - Ensure retrieved content + prompt + response fit within limits
  • Prompt Engineering - Structure prompts to guide the model to use retrieved context
  • Grounding Instructions - Instruct the model to only use provided context
  • Citation Generation - Ask the model to cite sources in its response

RAG Prompt Template Example

RAG_PROMPT = """You are a helpful assistant. Answer the user's question
based ONLY on the following context. If the context doesn't contain
enough information to answer, say "I don't have enough information."

Context:
{context}

Question: {question}

Instructions:
- Use only information from the provided context
- Cite sources using [1], [2], etc.
- If unsure, acknowledge uncertainty
- Be concise but complete

Answer:"""

Component 8: Orchestrator

The orchestrator coordinates all components, managing the flow from query to response. Popular orchestration frameworks include:

  • LangChain - Chains, agents, extensive integrations
  • LlamaIndex - Data ingestion, query engines, specialized for RAG
  • Haystack - Pipeline-based, production-focused

Code Example: Full RAG Chain

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Initialize components
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="./db", embedding_function=embeddings)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximum Marginal Relevance for diversity
    search_kwargs={"k": 5}
)

# Create prompt
prompt = PromptTemplate(
    template=RAG_PROMPT,
    input_variables=["context", "question"]
)

# Create chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Simple context stuffing
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

# Query
result = qa_chain.invoke({"query": "What is RAG?"})
print(result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])

Component Integration Diagram

Here's how all components work together in a production RAG system:

┌─────────────────────────────────────────────────────────────────┐
│                     INGESTION PIPELINE (Offline)                 │
├─────────────────────────────────────────────────────────────────┤
│  Documents → Loader → Splitter → Embeddings → Vector Store      │
│  (PDF,Web)   (Parse)  (Chunk)    (Vectorize)   (Index)          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                      QUERY PIPELINE (Online)                     │
├─────────────────────────────────────────────────────────────────┤
│  User Query → Query Enhancement → Retriever → Re-ranker         │
│             → Context Assembly → LLM → Response + Citations      │
└─────────────────────────────────────────────────────────────────┘

Key Takeaways

  • RAG systems consist of modular components that can be independently optimized
  • Document loaders handle diverse data sources; choose based on your data
  • Chunking strategy significantly impacts retrieval quality
  • Embedding model choice affects both quality and cost
  • Vector stores enable fast similarity search at scale
  • Re-ranking improves precision for top results
  • The LLM prompt should clearly instruct grounding in retrieved context
  • Orchestrators simplify building and maintaining RAG pipelines

In the next lesson, we'll dive deeper into retrieval strategies, exploring lexical, semantic, and hybrid approaches in detail.