RAG Architecture

Introduction to RAG Systems

Welcome to RAG Systems! This course will teach you how to build Retrieval Augmented Generation systems - one of the most powerful architectures for creating AI applications that can access and reason over custom knowledge bases.

What is RAG?

Retrieval Augmented Generation (RAG) is an architectural pattern that enhances Large Language Models by retrieving relevant information from external knowledge sources before generating responses. Instead of relying solely on the model's training data, RAG systems dynamically fetch contextual information to provide more accurate, up-to-date, and grounded answers.

The seminal paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Lewis et al. (2020) introduced RAG as a method that combines parametric memory (the LLM's trained weights) with non-parametric memory (external documents retrieved at inference time).

Motivation: Why RAG?

Large Language Models have revolutionized NLP, but they face several critical limitations that RAG addresses:

Knowledge Cutoff - LLMs are frozen in time. Their knowledge ends at the training cutoff date, making them unable to answer questions about recent events without external data sources.
Hallucinations - LLMs may confidently generate plausible-sounding but incorrect information. RAG grounds responses in retrieved facts, significantly reducing false information.
Domain Specificity - Generic LLMs lack specialized knowledge. RAG enables domain expertise without expensive fine-tuning by providing specialized documents at query time.
Transparency & Citability - RAG systems can show sources and citations for generated content, enabling fact-checking and building user trust.
Privacy & Security - Sensitive organizational data remains in your control, not baked into model weights. You can update, remove, or restrict access to specific documents.
Cost Efficiency - Fine-tuning LLMs is expensive and requires retraining when knowledge changes. RAG allows dynamic knowledge updates without model retraining.

The RAG Pipeline: End-to-End Flow

A production RAG system consists of two main phases: an offline ingestion phase and an online query phase.

Phase 1: Ingestion (Offline/Asynchronous)

This phase prepares documents for efficient retrieval:

Document Loading - Extract content from PDFs, web pages, databases, APIs, and other sources
Preprocessing - Clean text, remove boilerplate, handle special characters
Chunking - Split documents into semantically meaningful segments (covered in detail later)
Metadata Extraction - Extract or assign metadata (title, date, source, category) for filtering
Embedding Generation - Convert text chunks into dense vector representations using embedding models
Indexing - Store embeddings in a vector database with appropriate indices for fast similarity search

Phase 2: Query (Online/Per Request)

This phase handles user queries in real-time:

Query Understanding - Parse and understand user intent, optionally rewrite or expand queries
Query Embedding - Convert the query into a vector using the same embedding model
Retrieval - Find the most similar document chunks using vector similarity search
Re-ranking (Optional) - Reorder retrieved documents using a more sophisticated model
Context Augmentation - Combine retrieved context with the user query in a prompt
Generation - LLM generates a response based on the augmented prompt
Post-processing - Add citations, validate against retrieved context, format response

Core Components of RAG

A typical RAG system consists of several key components that work together:

Document Loaders - Handle various formats: PDF, HTML, Markdown, DOCX, databases, APIs
Text Splitters - Chunk documents using strategies like fixed-size, recursive, semantic, or sentence-based splitting
Embedding Models - Convert text to vectors (OpenAI Ada, Cohere, Sentence-Transformers, BGE, E5)
Vector Stores - Store and index embeddings (Pinecone, Weaviate, Qdrant, ChromaDB, Milvus, FAISS)
Retrievers - Find relevant documents using semantic, lexical, or hybrid search
Re-rankers - Cross-encoder models that reorder results (Cohere Rerank, BGE Reranker, ColBERT)
LLM - Generate responses using retrieved context (GPT-4, Claude, Gemini, Llama, Mistral)
Orchestrator - Coordinate the entire pipeline (LangChain, LlamaIndex, Haystack)

RAG vs. Fine-Tuning: When to Use Which?

Understanding when to use RAG versus fine-tuning is crucial for building effective AI systems:

Use RAG When:

Knowledge needs to be updated frequently (news, documentation, product catalogs)
You need source attribution and citations for compliance or trust
The knowledge base is too large to fit in context or training data
You want to control access to specific documents dynamically
Budget and time constraints prevent fine-tuning
You need to combine multiple specialized knowledge sources

Use Fine-Tuning When:

You need to change the model's behavior, style, or format consistently
Knowledge is relatively static and well-defined
Low latency is critical and retrieval overhead is unacceptable
You want the model to learn specialized reasoning patterns

Best Practice: Combine Both

In many production systems, the optimal approach combines both techniques: fine-tune for style/format/reasoning, then use RAG for dynamic knowledge retrieval.

RAG Variants and Patterns

The RAG ecosystem has evolved with several advanced patterns:

Naive/Vanilla RAG - Basic retrieve-then-generate approach. Simple but effective baseline.
Advanced RAG - Incorporates query rewriting, hybrid search, re-ranking, and metadata filtering for improved precision and recall.
Modular RAG - Combines multiple retrieval strategies (lexical + semantic + knowledge graph) with flexible orchestration.
Agentic RAG - Uses AI agents to dynamically decide when and how to retrieve, enabling multi-step reasoning and tool use.
Self-RAG - The model self-reflects on whether retrieval is needed and evaluates retrieved document relevance.
Corrective RAG (CRAG) - Evaluates retrieval quality and takes corrective actions if retrieved documents are irrelevant.
RAPTOR - Recursive Abstractive Processing for Tree-Organized Retrieval. Builds hierarchical summaries for multi-level retrieval.
HyDE (Hypothetical Document Embeddings) - Generates a hypothetical answer first, then uses it for retrieval instead of the original query.

Agentic RAG: The Next Evolution

Agentic RAG represents a significant evolution in RAG architectures, where an LLM acts as an intelligent agent that orchestrates retrieval dynamically:

Query Analysis - The agent analyzes whether the query requires retrieval at all
Source Selection - Chooses which knowledge sources to query (internal docs, web search, databases)
Iterative Retrieval - Can retrieve multiple times, refining queries based on initial results
Multi-Step Reasoning - Breaks complex questions into sub-questions, retrieving for each
Tool Integration - Can call other tools (calculators, APIs) alongside retrieval

This approach is particularly powerful for complex, multi-hop questions that require synthesizing information from multiple sources.

RAG vs. Long Context Windows

With models supporting 100K+ token contexts, you might wonder if RAG is still relevant. The answer is yes, for several reasons:

Computational Cost - Long context inference is expensive. RAG with targeted retrieval is more cost-efficient.
Lost in the Middle - Research shows LLMs struggle with information in the middle of long contexts. RAG places relevant info at the start.
Latency - Processing 100K tokens is slower than retrieving a focused 2K context.
Scalability - RAG scales to millions of documents; context windows don't.
Needle in a Haystack - RAG is more reliable at finding specific information than stuffing everything in context.

That said, long context windows complement RAG by allowing more retrieved chunks to be included in the prompt.

Real-World Applications

RAG powers numerous production applications across industries:

Customer Support - AI chatbots with access to product documentation, FAQs, and support history
Documentation Search - Intelligent technical documentation assistants (GitHub Copilot Chat, Stripe Docs)
Legal Research - Finding relevant cases, regulations, and precedents from vast legal databases
Medical Diagnosis Support - Assisting healthcare professionals with medical literature and guidelines
Financial Analysis - Analyzing earnings reports, SEC filings, and market research
Research Assistants - Helping researchers find and synthesize relevant academic papers
Enterprise Search - Semantic search across company documents, Slack, email, and knowledge bases
Code Assistants - Retrieving relevant code snippets, documentation, and examples from codebases

Popular RAG Frameworks

Several frameworks simplify RAG development:

LangChain - Comprehensive framework with extensive integrations for document loaders, vector stores, and LLMs
LlamaIndex - Specialized for data ingestion, indexing, and advanced retrieval strategies
Haystack - Production-ready NLP framework with focus on search and QA pipelines
Semantic Kernel - Microsoft's AI orchestration framework with enterprise features
DSPy - Declarative framework for programmatic optimization of LLM pipelines
Vercel AI SDK - Lightweight SDK for building AI applications with RAG support

What You'll Learn in This Course

This comprehensive course covers the full RAG development lifecycle:

RAG Architecture - Deep dive into components, data flow, and design patterns
Document Processing - Chunking strategies (naive, semantic, late chunking), metadata extraction, and preprocessing
Embeddings - Word vs. sentence embeddings, model selection, and optimization
Retrieval Strategies - Lexical (BM25), semantic, and hybrid retrieval with score fusion
Re-ranking - Cross-encoders, bi-encoders, ColBERT, and learning-to-rank
Query Enhancement - Query rewriting, expansion, HyDE, and multi-query strategies
Advanced Techniques - Contextual retrieval, sentence-window retrieval, auto-merging
Evaluation - Retrieval metrics (precision, recall, MRR) and generation metrics (faithfulness, relevance)
Production Optimization - Caching, batching, async processing, and cost optimization
Multimodal RAG - Handling images, tables, and other non-text content

Prerequisites

Understanding of Large Language Models (or take our LLM course first)
Basic knowledge of vector databases and embeddings (or take our Vector DB course)
Python programming experience
Familiarity with APIs and basic web development

Key Takeaways

By the end of this course, you'll be able to:

Design and implement production-grade RAG architectures
Choose appropriate chunking, embedding, and retrieval strategies for your use case
Evaluate and optimize RAG systems using industry-standard metrics
Build end-to-end RAG applications with proper error handling and observability
Deploy and scale RAG systems in production environments

Let's build powerful RAG systems!

RAG Systems