RAG Systems

0 of 13 lessons completed

Document Processing and Chunking

Document processing and chunking are critical stages in the RAG ingestion pipeline. How you split documents directly impacts retrieval quality. This lesson covers chunking strategies from basic to advanced, including the cutting-edge late chunking technique.

Why Chunking Matters

Chunking is not just about fitting text into embedding model limits. It fundamentally affects:

  • Retrieval Precision - Smaller chunks enable more targeted retrieval
  • Context Coherence - Chunks should be semantically complete
  • Generation Quality - LLMs perform better with focused, relevant context
  • Cost Efficiency - Optimal chunk sizes minimize token usage

Document Preprocessing

Before chunking, documents typically require preprocessing:

1. Text Extraction

  • PDFs - Use PyPDF2, pdfplumber, or unstructured.io for complex layouts
  • HTML - Strip tags, preserve structure (BeautifulSoup, trafilatura)
  • DOCX - Extract text and preserve formatting (python-docx)
  • OCR - For scanned documents (Tesseract, Azure Document Intelligence)

2. Text Cleaning

import re

def clean_text(text: str) -> str:
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove special characters (keep punctuation)
    text = re.sub(r'[^\w\s.,!?;:\-\'"]', '', text)
    
    # Normalize unicode
    import unicodedata
    text = unicodedata.normalize('NFKC', text)
    
    # Remove boilerplate (headers, footers, page numbers)
    text = re.sub(r'Page \d+ of \d+', '', text)
    
    return text.strip()

3. Metadata Extraction

def extract_metadata(document, source_path: str) -> dict:
    return {
        "source": source_path,
        "filename": os.path.basename(source_path),
        "file_type": source_path.split('.')[-1],
        "created_at": datetime.now().isoformat(),
        "title": extract_title(document),  # Custom extraction
        "author": extract_author(document),
        "word_count": len(document.split()),
    }

Chunking Strategies

1. Naive/Fixed-Size Chunking

The simplest approach: split text by character/token count with optional overlap.

What is Naive Chunking?

Naive chunking divides text into fixed-size segments without regard for semantic boundaries. Each chunk is embedded independently.

Example

from langchain.text_splitter import CharacterTextSplitter

text = """
Machine learning is a subset of artificial intelligence that enables
systems to learn from data. Deep learning, a subset of ML, uses neural
networks with multiple layers. These networks can automatically learn
representations from raw data.
"""

splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separator=" "
)

chunks = splitter.split_text(text)
# Output:
# Chunk 1: "Machine learning is a subset of artificial intelligence that enables systems to learn from data."
# Chunk 2: "from data. Deep learning, a subset of ML, uses neural networks with multiple layers."
# Chunk 3: "multiple layers. These networks can automatically learn representations from raw data."

Advantages

  • Simple to implement
  • Predictable chunk sizes
  • Fast processing

Limitations

  • May split mid-sentence or mid-concept
  • Chunks lose context about their position in the document
  • Pronouns and references may become ambiguous

2. Recursive Character Splitting

Tries multiple separators hierarchically to find natural break points.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", ", ", " ", ""],
    length_function=len
)

# Tries to split on paragraphs first, then sentences, then words
chunks = splitter.split_text(document_text)

Best Practice: This is the most commonly used splitter and works well for general text.

3. Document-Aware Splitting

Respects document structure based on format:

Markdown Splitting

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split)
chunks = splitter.split_text(markdown_text)

# Each chunk includes header hierarchy as metadata
# {"h1": "Introduction", "h2": "Overview", "content": "..."}

Code Splitting

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language
)

# Python code splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=500,
    chunk_overlap=50
)

# Respects function/class boundaries
chunks = python_splitter.split_text(python_code)

4. Semantic Chunking

Uses embeddings to identify semantic boundaries between sentences.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Semantic chunker identifies topic shifts
splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95  # Top 5% similarity drops indicate boundaries
)

chunks = splitter.split_text(document_text)

How It Works

  1. Split text into sentences
  2. Embed each sentence
  3. Calculate similarity between adjacent sentences
  4. Identify significant drops in similarity (topic boundaries)
  5. Group sentences between boundaries into chunks

Late Chunking: A Better Approach

Late chunking is a novel technique that addresses the fundamental problem with naive chunking: loss of contextual information.

The Problem with Naive Chunking

When you chunk first and embed second, each chunk loses awareness of the broader document context:

Document: "Berlin is the capital of Germany. It has a population of 3.6 million."

Naive Chunking:
  Chunk 1: "Berlin is the capital of Germany."
  Chunk 2: "It has a population of 3.6 million."

Problem: Chunk 2's embedding doesn't know "It" refers to "Berlin"!

What is Late Chunking?

Late chunking reverses the order: embed the entire document first using a long-context embedding model, then chunk the resulting token embeddings.

How Late Chunking Works

  1. Embed Full Document - Pass the entire document through a long-context transformer
  2. Get Token Embeddings - Obtain contextualized embedding for each token
  3. Chunk Token Sequences - Split the token embeddings (not the text)
  4. Pool per Chunk - Average or max-pool token embeddings within each chunk

Late Chunking Example

# Conceptual implementation of late chunking
from transformers import AutoModel, AutoTokenizer
import torch

def late_chunking(text: str, chunk_size: int = 256):
    # Load long-context model (e.g., jina-embeddings-v2-base-en supports 8192 tokens)
    model = AutoModel.from_pretrained("jinaai/jina-embeddings-v2-base-en")
    tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v2-base-en")
    
    # Step 1: Tokenize and embed ENTIRE document
    tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=8192)
    
    with torch.no_grad():
        # Get token-level embeddings (not pooled)
        outputs = model(**tokens, output_hidden_states=True)
        token_embeddings = outputs.last_hidden_state[0]  # [seq_len, hidden_dim]
    
    # Step 2: Chunk the token embeddings
    chunk_embeddings = []
    for i in range(0, len(token_embeddings), chunk_size):
        chunk_tokens = token_embeddings[i:i + chunk_size]
        # Mean pooling for this chunk
        chunk_embedding = chunk_tokens.mean(dim=0)
        chunk_embeddings.append(chunk_embedding)
    
    return chunk_embeddings

Advantages of Late Chunking

  • Preserves Context - Each token embedding is aware of the full document
  • Resolves References - Pronouns and references are contextualized
  • Better Retrieval - Studies show 2-10% improvement in recall@10

Trade-offs

  • Requires Long-Context Models - Not all embedding models support 8K+ tokens
  • Higher Compute - Embedding full documents is more expensive
  • More Complex Pipeline - Requires special handling of the embedding process

Late Interaction (ColBERT and ColPali)

Late interaction is a related but distinct concept used in re-ranking and retrieval.

What is Late Interaction?

Instead of a single embedding per query/document, late interaction maintains per-token embeddings and computes similarity at query time using MaxSim:

ColBERT: Late Interaction in Practice

# ColBERT scoring with MaxSim
def colbert_score(query_embeddings, doc_embeddings):
    """
    query_embeddings: [num_query_tokens, dim]
    doc_embeddings: [num_doc_tokens, dim]
    
    For each query token, find max similarity to any doc token.
    Sum these max similarities.
    """
    # Compute all pairwise similarities
    similarities = query_embeddings @ doc_embeddings.T  # [query_tokens, doc_tokens]
    
    # MaxSim: for each query token, take max similarity
    max_sims = similarities.max(dim=1).values  # [query_tokens]
    
    # Sum across query tokens
    return max_sims.sum()

# Example
# Query: "What is ML?" -> 4 token embeddings
# Doc: "Machine learning is a type of AI..." -> 8 token embeddings
# Score = MaxSim aggregated across all query tokens

ColPali: Multimodal Late Interaction

ColPali extends late interaction to multimodal retrieval, enabling similarity computation between text queries and document images (PDFs, screenshots).

  • No OCR required - works directly on document images
  • Preserves layout and visual information
  • Excellent for tables, figures, and complex documents

Comparative Analysis: Chunking Methods

MethodContext AwareCompute CostBest For
Naive/FixedLowSimple documents, prototyping
RecursiveLowGeneral purpose, production baseline
Document-AwarePartialLowStructured docs (Markdown, code)
SemanticMediumTopic-diverse documents
Late ChunkingHighHigh-stakes retrieval, complex docs
Late InteractionHighRe-ranking, precision-critical

Figuring Out the Ideal Chunk Size

Chunk size is one of the most important hyperparameters in RAG. Here's how to determine the optimal size:

Factors to Consider

  • Embedding Model Limit - Most models cap at 512 tokens; newer ones support 8K+
  • Query Type - Factoid questions → smaller chunks; complex questions → larger chunks
  • Content Type - Dense technical content → larger; FAQs → smaller
  • LLM Context Window - More chunks = less room for response

Empirical Tuning

# Sweep different chunk sizes and evaluate
chunk_sizes = [256, 512, 1024, 2048]
overlaps = [0, 64, 128, 256]

results = []
for size in chunk_sizes:
    for overlap in overlaps:
        if overlap >= size:
            continue
        
        # Build index with this configuration
        chunks = split_documents(documents, chunk_size=size, overlap=overlap)
        index = build_index(chunks)
        
        # Evaluate on test set
        metrics = evaluate_retrieval(index, test_queries)
        results.append({
            "chunk_size": size,
            "overlap": overlap,
            "recall@5": metrics["recall@5"],
            "mrr": metrics["mrr"]
        })

# Find best configuration
best = max(results, key=lambda x: x["recall@5"])

Rules of Thumb

  • Start with - 500-1000 characters with 10-20% overlap
  • For code - Respect function/class boundaries (1000-2000 chars)
  • For conversational - Smaller chunks (300-500 chars)
  • For long-form - Larger chunks (1000-2000 chars) with sentence-window retrieval

Sentence-Window Retrieval (Small-to-Large)

A powerful pattern that decouples embedding granularity from context window:

How It Works

  1. Embed small chunks (e.g., 1-2 sentences) for precise retrieval
  2. Store parent references linking chunks to larger context
  3. After retrieval, expand to include surrounding sentences/paragraphs
from llama_index import SentenceWindowNodeParser

# Parse with sentence windows
parser = SentenceWindowNodeParser.from_defaults(
    sentence_splitter=lambda text: text.split(". "),
    window_size=3,  # Include 3 sentences before/after
    window_metadata_key="window",
    original_text_metadata_key="original_text"
)

nodes = parser.get_nodes_from_documents(documents)

# Each node contains:
# - node.text: The specific sentence (embedded)
# - node.metadata["window"]: Surrounding context (returned to LLM)

Contextual Retrieval (Anthropic)

Anthropic's contextual retrieval technique prepends document-level context to each chunk:

import anthropic

def add_context_to_chunk(chunk: str, document: str) -> str:
    """Use Claude to generate contextual prefix for each chunk."""
    
    client = anthropic.Anthropic()
    
    prompt = f"""<document>
{document}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Please give a short succinct context to situate this chunk within 
the overall document for the purposes of improving search retrieval 
of the chunk. Answer only with the succinct context and nothing else."""

    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    context = response.content[0].text
    return f"{context}\n\n{chunk}"

# Example output:
# Original: "The system uses 256-bit AES encryption."
# Contextual: "This section from the Security Architecture Guide describes 
#             the encryption standards. The system uses 256-bit AES encryption."

Benefits of Contextual Retrieval

  • 49% reduction in retrieval failures (Anthropic benchmarks)
  • 67% reduction when combined with hybrid BM25
  • Works with any embedding model

Production Pipeline Example

from dataclasses import dataclass
from typing import List
import hashlib

@dataclass
class ProcessedChunk:
    id: str
    text: str
    embedding: List[float]
    metadata: dict

def process_document(
    document: str,
    source: str,
    chunk_size: int = 512,
    overlap: int = 50,
    add_context: bool = True
) -> List[ProcessedChunk]:
    
    # Step 1: Clean and preprocess
    clean_doc = clean_text(document)
    
    # Step 2: Extract metadata
    metadata = extract_metadata(document, source)
    
    # Step 3: Split into chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )
    chunks = splitter.split_text(clean_doc)
    
    # Step 4: Add contextual prefixes (optional)
    if add_context:
        chunks = [add_context_to_chunk(chunk, clean_doc) for chunk in chunks]
    
    # Step 5: Generate embeddings
    embeddings = embed_model.encode(chunks)
    
    # Step 6: Create processed chunks with IDs and metadata
    processed = []
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
        chunk_id = hashlib.md5(f"{source}:{i}".encode()).hexdigest()
        processed.append(ProcessedChunk(
            id=chunk_id,
            text=chunk,
            embedding=embedding.tolist(),
            metadata={
                **metadata,
                "chunk_index": i,
                "total_chunks": len(chunks)
            }
        ))
    
    return processed

Key Takeaways

  • Preprocessing matters - Clean text and extract metadata before chunking
  • Recursive splitting is the reliable baseline for most use cases
  • Semantic chunking helps with topic-diverse documents
  • Late chunking preserves document context but requires long-context models
  • Sentence-window retrieval enables precise retrieval with rich context
  • Contextual retrieval significantly improves retrieval quality
  • Tune chunk size empirically - start at 500-1000 chars

In the next lesson, we'll explore embedding generation and storage in detail, including model selection and optimization strategies.