RAG Architecture

Document Processing and Chunking

Document processing and chunking are critical stages in the RAG ingestion pipeline. How you split documents directly impacts retrieval quality. This lesson covers chunking strategies from basic to advanced, including the cutting-edge late chunking technique.

Why Chunking Matters

Chunking is not just about fitting text into embedding model limits. It fundamentally affects:

Retrieval Precision - Smaller chunks enable more targeted retrieval
Context Coherence - Chunks should be semantically complete
Generation Quality - LLMs perform better with focused, relevant context
Cost Efficiency - Optimal chunk sizes minimize token usage

Document Preprocessing

Before chunking, documents typically require preprocessing:

1. Text Extraction

PDFs - Use PyPDF2, pdfplumber, or unstructured.io for complex layouts
HTML - Strip tags, preserve structure (BeautifulSoup, trafilatura)
DOCX - Extract text and preserve formatting (python-docx)
OCR - For scanned documents (Tesseract, Azure Document Intelligence)

2. Text Cleaning

import re

def clean_text(text: str) -> str:
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove special characters (keep punctuation)
    text = re.sub(r'[^\w\s.,!?;:\-\'"]', '', text)
    
    # Normalize unicode
    import unicodedata
    text = unicodedata.normalize('NFKC', text)
    
    # Remove boilerplate (headers, footers, page numbers)
    text = re.sub(r'Page \d+ of \d+', '', text)
    
    return text.strip()

3. Metadata Extraction

def extract_metadata(document, source_path: str) -> dict:
    return {
        "source": source_path,
        "filename": os.path.basename(source_path),
        "file_type": source_path.split('.')[-1],
        "created_at": datetime.now().isoformat(),
        "title": extract_title(document),  # Custom extraction
        "author": extract_author(document),
        "word_count": len(document.split()),
    }

Chunking Strategies

1. Naive/Fixed-Size Chunking

The simplest approach: split text by character/token count with optional overlap.

What is Naive Chunking?

Naive chunking divides text into fixed-size segments without regard for semantic boundaries. Each chunk is embedded independently.

Example

from langchain.text_splitter import CharacterTextSplitter

text = """
Machine learning is a subset of artificial intelligence that enables
systems to learn from data. Deep learning, a subset of ML, uses neural
networks with multiple layers. These networks can automatically learn
representations from raw data.
"""

splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separator=" "
)

chunks = splitter.split_text(text)
# Output:
# Chunk 1: "Machine learning is a subset of artificial intelligence that enables systems to learn from data."
# Chunk 2: "from data. Deep learning, a subset of ML, uses neural networks with multiple layers."
# Chunk 3: "multiple layers. These networks can automatically learn representations from raw data."

Advantages

Simple to implement
Predictable chunk sizes
Fast processing

Limitations

May split mid-sentence or mid-concept
Chunks lose context about their position in the document
Pronouns and references may become ambiguous

2. Recursive Character Splitting

Tries multiple separators hierarchically to find natural break points.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", ", ", " ", ""],
    length_function=len
)

# Tries to split on paragraphs first, then sentences, then words
chunks = splitter.split_text(document_text)

Best Practice: This is the most commonly used splitter and works well for general text.

3. Document-Aware Splitting

Respects document structure based on format:

Markdown Splitting

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split)
chunks = splitter.split_text(markdown_text)

# Each chunk includes header hierarchy as metadata
# {"h1": "Introduction", "h2": "Overview", "content": "..."}

Code Splitting

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language
)

# Python code splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=500,
    chunk_overlap=50
)

# Respects function/class boundaries
chunks = python_splitter.split_text(python_code)

4. Semantic Chunking

Uses embeddings to identify semantic boundaries between sentences.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Semantic chunker identifies topic shifts
splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95  # Top 5% similarity drops indicate boundaries
)

chunks = splitter.split_text(document_text)

How It Works

Split text into sentences
Embed each sentence
Calculate similarity between adjacent sentences
Identify significant drops in similarity (topic boundaries)
Group sentences between boundaries into chunks

Late Chunking: A Better Approach

Late chunking is a novel technique that addresses the fundamental problem with naive chunking: loss of contextual information.

The Problem with Naive Chunking

When you chunk first and embed second, each chunk loses awareness of the broader document context:

Document: "Berlin is the capital of Germany. It has a population of 3.6 million."

Naive Chunking:
  Chunk 1: "Berlin is the capital of Germany."
  Chunk 2: "It has a population of 3.6 million."

Problem: Chunk 2's embedding doesn't know "It" refers to "Berlin"!

What is Late Chunking?

Late chunking reverses the order: embed the entire document first using a long-context embedding model, then chunk the resulting token embeddings.

How Late Chunking Works

Embed Full Document - Pass the entire document through a long-context transformer
Get Token Embeddings - Obtain contextualized embedding for each token
Chunk Token Sequences - Split the token embeddings (not the text)
Pool per Chunk - Average or max-pool token embeddings within each chunk

Late Chunking Example

# Conceptual implementation of late chunking
from transformers import AutoModel, AutoTokenizer
import torch

def late_chunking(text: str, chunk_size: int = 256):
    # Load long-context model (e.g., jina-embeddings-v2-base-en supports 8192 tokens)
    model = AutoModel.from_pretrained("jinaai/jina-embeddings-v2-base-en")
    tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v2-base-en")
    
    # Step 1: Tokenize and embed ENTIRE document
    tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=8192)
    
    with torch.no_grad():
        # Get token-level embeddings (not pooled)
        outputs = model(**tokens, output_hidden_states=True)
        token_embeddings = outputs.last_hidden_state[0]  # [seq_len, hidden_dim]
    
    # Step 2: Chunk the token embeddings
    chunk_embeddings = []
    for i in range(0, len(token_embeddings), chunk_size):
        chunk_tokens = token_embeddings[i:i + chunk_size]
        # Mean pooling for this chunk
        chunk_embedding = chunk_tokens.mean(dim=0)
        chunk_embeddings.append(chunk_embedding)
    
    return chunk_embeddings

Advantages of Late Chunking

Preserves Context - Each token embedding is aware of the full document
Resolves References - Pronouns and references are contextualized
Better Retrieval - Studies show 2-10% improvement in recall@10

Trade-offs

Requires Long-Context Models - Not all embedding models support 8K+ tokens
Higher Compute - Embedding full documents is more expensive
More Complex Pipeline - Requires special handling of the embedding process

Late Interaction (ColBERT and ColPali)

Late interaction is a related but distinct concept used in re-ranking and retrieval.

What is Late Interaction?

Instead of a single embedding per query/document, late interaction maintains per-token embeddings and computes similarity at query time using MaxSim:

ColBERT: Late Interaction in Practice

# ColBERT scoring with MaxSim
def colbert_score(query_embeddings, doc_embeddings):
    """
    query_embeddings: [num_query_tokens, dim]
    doc_embeddings: [num_doc_tokens, dim]
    
    For each query token, find max similarity to any doc token.
    Sum these max similarities.
    """
    # Compute all pairwise similarities
    similarities = query_embeddings @ doc_embeddings.T  # [query_tokens, doc_tokens]
    
    # MaxSim: for each query token, take max similarity
    max_sims = similarities.max(dim=1).values  # [query_tokens]
    
    # Sum across query tokens
    return max_sims.sum()

# Example
# Query: "What is ML?" -> 4 token embeddings
# Doc: "Machine learning is a type of AI..." -> 8 token embeddings
# Score = MaxSim aggregated across all query tokens

ColPali: Multimodal Late Interaction

ColPali extends late interaction to multimodal retrieval, enabling similarity computation between text queries and document images (PDFs, screenshots).

No OCR required - works directly on document images
Preserves layout and visual information
Excellent for tables, figures, and complex documents

Comparative Analysis: Chunking Methods

Method	Context Aware	Compute Cost	Best For
Naive/Fixed	❌	Low	Simple documents, prototyping
Recursive	❌	Low	General purpose, production baseline
Document-Aware	Partial	Low	Structured docs (Markdown, code)
Semantic	❌	Medium	Topic-diverse documents
Late Chunking	✅	High	High-stakes retrieval, complex docs
Late Interaction	✅	High	Re-ranking, precision-critical

Figuring Out the Ideal Chunk Size

Chunk size is one of the most important hyperparameters in RAG. Here's how to determine the optimal size:

Factors to Consider

Embedding Model Limit - Most models cap at 512 tokens; newer ones support 8K+
Query Type - Factoid questions → smaller chunks; complex questions → larger chunks
Content Type - Dense technical content → larger; FAQs → smaller
LLM Context Window - More chunks = less room for response

Empirical Tuning

# Sweep different chunk sizes and evaluate
chunk_sizes = [256, 512, 1024, 2048]
overlaps = [0, 64, 128, 256]

results = []
for size in chunk_sizes:
    for overlap in overlaps:
        if overlap >= size:
            continue
        
        # Build index with this configuration
        chunks = split_documents(documents, chunk_size=size, overlap=overlap)
        index = build_index(chunks)
        
        # Evaluate on test set
        metrics = evaluate_retrieval(index, test_queries)
        results.append({
            "chunk_size": size,
            "overlap": overlap,
            "recall@5": metrics["recall@5"],
            "mrr": metrics["mrr"]
        })

# Find best configuration
best = max(results, key=lambda x: x["recall@5"])

Rules of Thumb

Start with - 500-1000 characters with 10-20% overlap
For code - Respect function/class boundaries (1000-2000 chars)
For conversational - Smaller chunks (300-500 chars)
For long-form - Larger chunks (1000-2000 chars) with sentence-window retrieval

Sentence-Window Retrieval (Small-to-Large)

A powerful pattern that decouples embedding granularity from context window:

How It Works

Embed small chunks (e.g., 1-2 sentences) for precise retrieval
Store parent references linking chunks to larger context
After retrieval, expand to include surrounding sentences/paragraphs

from llama_index import SentenceWindowNodeParser

# Parse with sentence windows
parser = SentenceWindowNodeParser.from_defaults(
    sentence_splitter=lambda text: text.split(". "),
    window_size=3,  # Include 3 sentences before/after
    window_metadata_key="window",
    original_text_metadata_key="original_text"
)

nodes = parser.get_nodes_from_documents(documents)

# Each node contains:
# - node.text: The specific sentence (embedded)
# - node.metadata["window"]: Surrounding context (returned to LLM)

Contextual Retrieval (Anthropic)

Anthropic's contextual retrieval technique prepends document-level context to each chunk:

import anthropic

def add_context_to_chunk(chunk: str, document: str) -> str:
    """Use Claude to generate contextual prefix for each chunk."""
    
    client = anthropic.Anthropic()
    
    prompt = f"""<document>
{document}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Please give a short succinct context to situate this chunk within 
the overall document for the purposes of improving search retrieval 
of the chunk. Answer only with the succinct context and nothing else."""

    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    context = response.content[0].text
    return f"{context}\n\n{chunk}"

# Example output:
# Original: "The system uses 256-bit AES encryption."
# Contextual: "This section from the Security Architecture Guide describes 
#             the encryption standards. The system uses 256-bit AES encryption."

Benefits of Contextual Retrieval

49% reduction in retrieval failures (Anthropic benchmarks)
67% reduction when combined with hybrid BM25
Works with any embedding model

Production Pipeline Example

from dataclasses import dataclass
from typing import List
import hashlib

@dataclass
class ProcessedChunk:
    id: str
    text: str
    embedding: List[float]
    metadata: dict

def process_document(
    document: str,
    source: str,
    chunk_size: int = 512,
    overlap: int = 50,
    add_context: bool = True
) -> List[ProcessedChunk]:
    
    # Step 1: Clean and preprocess
    clean_doc = clean_text(document)
    
    # Step 2: Extract metadata
    metadata = extract_metadata(document, source)
    
    # Step 3: Split into chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )
    chunks = splitter.split_text(clean_doc)
    
    # Step 4: Add contextual prefixes (optional)
    if add_context:
        chunks = [add_context_to_chunk(chunk, clean_doc) for chunk in chunks]
    
    # Step 5: Generate embeddings
    embeddings = embed_model.encode(chunks)
    
    # Step 6: Create processed chunks with IDs and metadata
    processed = []
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
        chunk_id = hashlib.md5(f"{source}:{i}".encode()).hexdigest()
        processed.append(ProcessedChunk(
            id=chunk_id,
            text=chunk,
            embedding=embedding.tolist(),
            metadata={
                **metadata,
                "chunk_index": i,
                "total_chunks": len(chunks)
            }
        ))
    
    return processed

Key Takeaways

Preprocessing matters - Clean text and extract metadata before chunking
Recursive splitting is the reliable baseline for most use cases
Semantic chunking helps with topic-diverse documents
Late chunking preserves document context but requires long-context models
Sentence-window retrieval enables precise retrieval with rich context
Contextual retrieval significantly improves retrieval quality
Tune chunk size empirically - start at 500-1000 chars

In the next lesson, we'll explore embedding generation and storage in detail, including model selection and optimization strategies.

RAG Systems

RAG Architecture

Building RAG Pipelines

Advanced RAG Techniques

Production RAG Systems

Document Processing and Chunking

Why Chunking Matters

Document Preprocessing

1. Text Extraction

2. Text Cleaning

3. Metadata Extraction

Chunking Strategies

1. Naive/Fixed-Size Chunking

What is Naive Chunking?

Example

Advantages

Limitations

2. Recursive Character Splitting

3. Document-Aware Splitting

Markdown Splitting

Code Splitting

4. Semantic Chunking

How It Works

Late Chunking: A Better Approach

The Problem with Naive Chunking

What is Late Chunking?

How Late Chunking Works

Late Chunking Example

Advantages of Late Chunking

Trade-offs

Late Interaction (ColBERT and ColPali)

What is Late Interaction?

ColBERT: Late Interaction in Practice

ColPali: Multimodal Late Interaction

Comparative Analysis: Chunking Methods

Figuring Out the Ideal Chunk Size

Factors to Consider

Empirical Tuning

Rules of Thumb

Sentence-Window Retrieval (Small-to-Large)

How It Works

Contextual Retrieval (Anthropic)

Benefits of Contextual Retrieval

Production Pipeline Example

Key Takeaways