What is text chunking in RAG systems?

Text chunking is the process of splitting long documents into smaller, semantically coherent segments (chunks) suitable for embedding and retrieval. In RAG pipelines, chunking determines the granularity at which knowledge is indexed -- instead of embedding entire documents, we embed chunks. This allows retrieval systems to return specific relevant passages rather than whole documents, improving precision and reducing wasted context in the LLM prompt. Think of it this way: without chunking, asking your RAG system a specific question is like searching for a quote by reading entire books. With chunking, you're searching an index of paragraphs.

What is the best chunk size for RAG?

There is no universal optimal chunk size -- it depends on your corpus, embedding model, and downstream task. A common starting point is 512-1024 tokens with 10-20% overlap. Smaller chunks (128-256 tokens) maximize retrieval precision but risk losing context. Larger chunks (1024-2048 tokens) preserve context but reduce precision. The right answer? Empirically validate on a representative evaluation set and tune based on retrieval recall and answer quality metrics. In practice, I've found that 512 tokens with 100-token overlap works well for most English technical documentation. For multilingual corpora (e.g., mixing Hindi and English), you may need to adjust since tokenization varies significantly across languages.

Should I use character-based or token-based chunk size?

Token-based. Always. Embedding models enforce token limits, not character limits. A 1000-character chunk might be 250 tokens (mostly English prose) or 500 tokens (if it contains code, special characters, or non-Latin scripts like Devanagari). Character-based sizing leads to unpredictable truncation. Always use the same tokenizer as your embedding model -- for OpenAI models, that's tiktoken with cl100k_base encoding.

What is chunk overlap and why is it important?

Chunk overlap means consecutive chunks share a region of text. For example, if overlap is 200 tokens, the last 200 tokens of chunk $N$ are the first 200 tokens of chunk $N+1$. Why does this matter? It prevents sentences or thoughts spanning chunk boundaries from being split, which would lose critical context. Without overlap, a sentence like 'The refund policy described above applies to all transactions over Rs. 10,000' might get split -- with 'The refund policy described above' in one chunk and 'applies to all transactions over Rs. 10,000' in another. Neither chunk is useful on its own. Overlap adds storage and embedding cost (a 20% overlap means ~20% more chunks) but significantly improves retrieval quality for content near boundaries. 10-20% overlap is the standard recommendation.

When should I use semantic chunking instead of fixed-size chunking?

Use semantic chunking when your documents have clear topic shifts that fixed-size splits would violate -- think encyclopedia articles, research papers with distinct sections, or product catalogs with varied categories. Semantic chunking detects these boundaries by embedding sentences and splitting where similarity drops. However, it requires embedding at chunking time (not just at indexing), which adds cost. For a 50,000-document corpus, semantic chunking might cost $50-100 (Rs. 4,000-8,500) in compute compared to near-zero for recursive splitting. My recommendation: for most applications, recursive character splitting with sentence-aware separators provides 80% of the benefit at 20% of the cost. Upgrade to semantic chunking only when your retrieval metrics justify it.

What is late chunking?

Late chunking flips the traditional approach on its head. Instead of chunk-then-embed, it embeds the entire document with a long-context model, *then* splits the resulting contextualized embeddings at sentence or paragraph boundaries. Why is this powerful? Because cross-chunk context is preserved in the embeddings themselves. A pronoun reference in chunk 2 can still attend to its antecedent in chunk 1 during the embedding process, since the full document was embedded together. However, it requires expensive long-context models and is only viable for documents that fit within the model's context window (typically 8K-32K tokens). For a 100-page document, you'd need a 32K+ context model. This makes late chunking best suited for medium-length, high-value documents where retrieval quality justifies the cost.

How does chunking affect retrieval precision vs. recall?

Let's break this down: - **Smaller chunks increase precision**: Retrieved chunks are more likely to be fully relevant. BUT recall can suffer if the answer spans multiple chunks that aren't all retrieved. - **Larger chunks improve recall**: More content per retrieved chunk means less chance of missing the answer. BUT precision drops because chunks contain more irrelevant content alongside the answer. - **Overlap helps recall**: By ensuring boundary-spanning content is fully captured in at least one chunk. The optimal balance depends on your downstream task. Citation-heavy applications (legal, medical) need high precision -- use smaller chunks. Exploratory search applications benefit from higher recall -- use larger chunks with parent-child retrieval.

Should I chunk before or after removing headers and footers?

Always clean documents *before* chunking. This is non-negotiable. If you chunk first, headers/footers/boilerplate get distributed across many chunks, polluting many embeddings. If you clean first, non-content elements are removed once, and all chunks contain only meaningful text. This is especially critical for PDFs and scanned documents where page numbers, watermarks, and disclaimers repeat on every page. I've seen RAG systems where 30% of retrieved chunks were dominated by repeated footer text like 'Confidential -- For Internal Use Only'. Cleaning before chunking eliminated the problem entirely.

RAG Pipeline

Text Chunker in Machine Learning

Let's talk about text chunking -- the unsung hero of every RAG pipeline.

A text chunker is a preprocessing component that partitions long documents into smaller, semantically coherent segments suitable for embedding and retrieval. In retrieval-augmented generation (RAG) systems, chunking determines the granularity at which knowledge is indexed and retrieved. Get it wrong and everything downstream suffers: too-small chunks lack the context an LLM needs to generate a useful answer; too-large chunks conflate semantically distinct content into a single vector, and retrieval precision tanks.

The chunker sits at a critical junction -- after document ingestion but before embedding -- and it defines the retrieval unit that every downstream component operates on. Think of it as the decision that echoes through your entire pipeline.

Modern chunking strategies range from naive fixed-size splitting to context-aware methods that preserve sentence boundaries, leverage embedding models to detect semantic shifts, or maintain hierarchical parent-child relationships. The choice of strategy directly impacts retrieval recall, answer quality, and the computational cost of indexing and querying your knowledge base.

The chunker doesn't just "split text" -- it decides what your retrieval system can and cannot find. Choose wisely.

Concept Snapshot

What It Is: A preprocessing module that splits long documents into smaller, contextually coherent text segments (chunks) optimized for embedding-based retrieval in RAG pipelines.
Category: RAG Pipeline
Complexity: Intermediate
Inputs / Outputs: Input: raw documents (text, markdown, PDF extracts). Output: array of text chunks with metadata (source document ID, character offsets, chunk index).
System Placement: Sits between the document loader (upstream) and the embedding model (downstream) in a RAG ingestion pipeline.
Also Known As: document chunker, text splitter, text segmentation, passage extractor, chunk generator
Typical Users: ML engineers, RAG engineers, search/retrieval specialists, NLP engineers
Prerequisites: Document parsing, Text tokenization, Embedding models, Basic NLP
Key Terms: chunk sizechunk overlapsemantic boundarycontext windowretrieval granularitysentence windowparent-child chunkslate chunking

Why This Concept Exists

The Mismatch Problem

Here's the fundamental issue: embedding models and retrieval systems impose practical constraints that raw documents simply don't satisfy.

Transformer-based encoders have finite context windows. BERT variants typically handle 512 tokens, while modern long-context models extend to 8K-32K tokens -- but even they can't ingest an entire book or codebase in one pass. And even when it's technically feasible to squeeze a full document into one embedding, you lose granularity: the resulting vector averages over all content, making it nearly impossible to retrieve specific facts or passages.

Retrieval at Scale

Retrieval systems suffer similar problems at scale. When a user at, say, an Indian fintech company asks "What are the RBI guidelines for KYC verification?" your vector store should return the most relevant passage -- not an entire 120-page regulatory document. Downstream LLMs have their own context limits too. Feeding retrieved text into GPT-4 or Claude means staying within the prompt budget, which often means 2-4 pages of context maximum.

The Solution

The text chunker emerged as the bridge across this mismatch. By splitting documents into semantically meaningful units of appropriate size, it enables:

Embedding models to produce focused, high-quality representations
Retrieval systems to return precise passages
LLMs to consume relevant context without exceeding token limits

The alternative -- retrieving whole documents or arbitrary character ranges -- produces poor recall and wasteful context that dilutes answer quality. That's not a system anyone wants to ship to production.

Without chunking, your RAG pipeline is a search engine that returns entire books when you asked for a paragraph.

Core Intuition & Mental Model

The Central Tension: Granularity vs. Context

The core challenge in chunking is balancing two competing forces: granularity and context.

Small chunks maximize retrieval precision by isolating specific facts. BUT they risk losing surrounding context that gives meaning to the isolated passage. Large chunks preserve context but dilute the semantic focus of the embedding, reducing retrieval accuracy.

The Library Analogy

Let me give you a mental model. Imagine searching a library:

If books are the retrieval unit (large chunks), you get tons of context but must scan entire volumes to find your answer.
If sentences are the unit (tiny chunks), you find exact quotes fast but lose the surrounding explanation that makes them meaningful.
Paragraphs or subsections (medium chunks) often strike the right balance.

A 10,000-word document split into 512-token chunks gives us roughly 40 chunks. That's 40 focused vectors in your embedding space, each one retrievable independently. But what happens when a critical piece of context gets split across two chunks? That's precisely the problem overlap and advanced strategies try to solve.

How Strategies Differ

Chunking strategies differ in how they define boundaries:

Fixed-size methods prioritize simplicity and uniform memory footprint.
Semantic methods attempt to detect topic shifts and break at natural boundaries -- the end of a paragraph, a heading, or a sentence that signals a new subject.
Hierarchical methods maintain both: small chunks for precise retrieval with links to larger parent chunks that provide context to the LLM.

It's okay if these distinctions feel blurry right now -- we'll walk through each one in detail soon.

The chunker doesn't interpret meaning. It applies heuristics (character counts, sentence delimiters, embedding similarity) to approximate semantic coherence. The quality ceiling is set by how well these heuristics align with the true information structure of your corpus.

Technical Foundations

The Math Behind Chunking

Let's formalize what a text chunker actually does. Don't worry -- I'll explain the intuition before dropping any formulas.

A text chunker is simply a function that takes a document and produces an ordered list of text segments. More formally, it implements a function $f: D \rightarrow C$ where $D$ is the set of documents and $C$ is the set of chunk sequences.

Given a document $d \in D$ with text content $T(d)$ of length $n$ characters, the chunker produces a sequence of chunks:

$C(d) = [c_1, c_2, \ldots, c_m]$

where each chunk $c_i$ is a substring of $T(d)$ satisfying:

Length constraint: $|c_i| \leq \text{max\_chunk\_size}$ (often 512-2048 characters or 128-512 tokens)
Overlap constraint: $c_i$ and $c_{i+1}$ may share an overlap region of length $o$ (typically 10-20% of chunk size) to preserve context across boundaries
Coverage: $\bigcup_i c_i \approx T(d)$ , meaning the union of chunks approximates the original document (with possible small gaps or overlaps)

Strategy Breakdown

Chunking strategies differ in how they select chunk boundaries. Let's walk through each:

Fixed-size chunking: Split every $\text{max\_chunk\_size}$ characters (or tokens), optionally with overlap. Boundaries are arbitrary and may split sentences or words. That was pretty simple, wasn't it?

Recursive character splitting: Recursively attempt to split on paragraph delimiters ("\n\n"), then sentence delimiters (".", "!", "?"), then word boundaries, falling back to character-level splits only if necessary. This is the workhorse of most RAG applications.

Semantic chunking: Embed consecutive sentences or small text units and compute pairwise cosine similarity:

$\text{sim}(c_i, c_{i+1}) = \frac{\mathbf{e}_i \cdot \mathbf{e}_{i+1}}{\|\mathbf{e}_i\| \cdot \|\mathbf{e}_{i+1}\|}$

Insert a chunk boundary wherever similarity drops below a threshold $\tau$ , indicating a topic shift.

Sentence-window chunking: Treat each sentence as a retrieval unit but embed it with surrounding sentences (e.g., $\pm 1$ sentence) to provide context. At retrieval time, the window is expanded for the LLM.

Late chunking: Embed the entire document with a long-context model, then split the contextualized embeddings at sentence or paragraph boundaries, preserving cross-chunk context in the embeddings themselves (Gunther et al., 2024). This is the newest approach and arguably the most elegant -- but also the most expensive.

Internal Architecture

A text chunker architecture consists of four core stages: document preprocessing (cleaning, normalization), boundary detection (where to split), chunk extraction (creating the chunks with overlap), and metadata attachment (tracking provenance and position). Optional components include a tokenizer (to enforce token-based size limits) and a semantic scorer (to evaluate chunk coherence). Let's walk through each component.

Text Chunking for RAG: Strategies, Algorithms & Best Practices Architecture — A linear pipeline: Document Loader -> Text Chunker (with internal stages: Preprocess -> Detect Bo...

Key Components

Preprocessor

Cleans raw text -- removes excessive whitespace, normalizes line breaks, handles encoding issues, and optionally strips non-content elements (headers, footers, boilerplate). This is your first line of defense against noisy embeddings.

Boundary Detector

Identifies candidate split points based on the chosen strategy: character offsets (fixed-size), delimiter matches (recursive), or semantic similarity drops (semantic chunking). This is where the strategy decision materializes into actual boundaries.

Chunk Extractor

Extracts substrings between boundaries, optionally adding overlap by extending chunks backward by $o$ characters or tokens. Ensures no chunk exceeds max size. Think of it as the scissors that actually cut the text.

Tokenizer (Optional)

Counts tokens (not just characters) to ensure chunks fit embedding model context windows. Uses the same tokenizer as the downstream embedding model. This is essential -- a 1000-character chunk can be anywhere from 200 to 500 tokens depending on content.

Semantic Scorer (Optional)

For semantic chunking, embeds consecutive text units and computes cosine similarity between adjacent embeddings to detect topic boundaries. Typically uses a lightweight model like all-MiniLM-L6-v2 (~80MB).

Metadata Annotator

Attaches metadata to each chunk: source document ID, character start/end offsets, chunk index, and optionally parent chunk IDs (for hierarchical strategies). Without this, you can't trace retrieved chunks back to source documents -- and citation becomes impossible.

Data Flow

Raw document -> Preprocessor -> Boundary Detector -> Chunk Extractor (applies overlap) -> Tokenizer (validates size) -> Metadata Annotator -> Output chunks with metadata. For semantic strategies, the Boundary Detector queries the Semantic Scorer to evaluate candidate split points before finalizing boundaries.

A linear pipeline: Document Loader -> Text Chunker (with internal stages: Preprocess -> Detect Boundaries -> Extract Chunks -> Annotate Metadata) -> Embedding Model -> Vector Store.

How to Implement

Implementation patterns depend on the chosen strategy. Fixed-size and recursive character splitting require only string manipulation and regex -- you can get a solid chunker running in under 50 lines of Python. Semantic chunking requires an embedding model (often a lightweight SBERT model) to compute sentence similarities. Late chunking requires a long-context embedding model that supports post-hoc splitting.

Libraries like LangChain, LlamaIndex, and Haystack provide pre-built chunkers that work out of the box. However, production systems often customize these for domain-specific boundary rules -- for instance, a legal document chunker at an Indian law firm might split on section headers defined in the Indian IT Act format.

Let's look at concrete implementations.

Recursive Character Text Splitter (LangChain)15 lines

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define separators in priority order
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,          # max chunk size in characters
    chunk_overlap=200,        # overlap between consecutive chunks
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],  # try paragraph, then sentence, then word boundaries
)

document_text = "..."  # long document text
chunks = text_splitter.split_text(document_text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} chars\n{chunk[:100]}...\n")

Recursive character splitting attempts to split at natural boundaries (paragraphs, sentences) before falling back to character-level splits. The 200-character overlap ensures that context spanning chunk boundaries is not lost. This strategy balances simplicity with respect for document structure and is the default for most RAG applications.

I'd recommend starting here for any new project. It handles 80% of use cases well, and you can always upgrade to semantic chunking later if retrieval quality demands it.

Semantic Chunking with Sentence Embeddings35 lines

from sentence_transformers import SentenceTransformer
import numpy as np
import re

model = SentenceTransformer('all-MiniLM-L6-v2')  # lightweight SBERT model

def semantic_chunking(text, similarity_threshold=0.5):
    # Split into sentences
    sentences = re.split(r'(?<=[.!?])\s+', text)
    if len(sentences) < 2:
        return [text]
    
    # Embed all sentences
    embeddings = model.encode(sentences)
    
    # Compute pairwise cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = np.dot(embeddings[i], embeddings[i+1]) / (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1]))
        similarities.append(sim)
    
    # Insert chunk boundaries where similarity drops below threshold
    chunks = []
    current_chunk = [sentences[0]]
    for i, sim in enumerate(similarities):
        if sim < similarity_threshold:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentences[i+1]]
        else:
            current_chunk.append(sentences[i+1])
    chunks.append(' '.join(current_chunk))
    
    return chunks

chunks = semantic_chunking(document_text, similarity_threshold=0.6)

Semantic chunking embeds each sentence and measures similarity between adjacent pairs. When similarity drops significantly, it signals a topic shift and a chunk boundary is inserted. This produces variable-length chunks that respect semantic coherence.

However, there's a catch: it's computationally more expensive because you're running an embedding model at chunking time, not just at indexing time. For a corpus of 100,000 documents averaging 5,000 words each, that's roughly 10 million sentence embeddings just for chunking -- which could cost around $50-100 (approximately Rs. 4,000-8,500) in compute on a cloud GPU.

Sentence-Window Retrieval (LlamaIndex)20 lines

from llama_index import Document, ServiceContext
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

# Create documents
documents = [Document(text=long_text)]

# Configure sentence window parser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,           # embed with ±3 sentences of context
    window_metadata_key="window",
    original_text_metadata_key="original_sentence",
)

service_context = ServiceContext.from_defaults(node_parser=node_parser)

# At retrieval time, replace the retrieved sentence with its full window context
postprocessor = MetadataReplacementPostProcessor(target_metadata_key="window")

# Index and query as usual; the postprocessor injects context

Sentence-window chunking stores individual sentences as retrieval units but embeds them with surrounding context (the 'window'). At retrieval time, the retrieved sentence is replaced with its full window before being passed to the LLM. This maximizes retrieval precision (you're matching against a focused sentence) while ensuring the LLM receives adequate context (it gets the surrounding sentences too).

This is one of my favorite strategies for documentation and knowledge base use cases. It's elegant and surprisingly effective.

Configuration Example17 lines

# Example chunking config (YAML)
chunking:
  strategy: recursive_character
  chunk_size: 1000          # characters
  chunk_overlap: 200        # characters
  separators:
    - "\n\n"                # paragraph
    - "\n"                  # line break
    - ". "                  # sentence
    - " "                   # word
  length_function: token_count  # use tokenizer, not len()
  tokenizer_model: "text-embedding-3-small"
  metadata:
    - source_document_id
    - chunk_index
    - char_start
    - char_end

Common Implementation Mistakes

●
Using character-based chunk size when the embedding model enforces token limits -- a 1000-character chunk may exceed 512 tokens depending on tokenization, causing silent truncation or errors. Always measure in tokens.
●
Setting zero overlap, which causes sentences or thoughts spanning chunk boundaries to be split, losing critical context for retrieval. Even 50-100 tokens of overlap makes a significant difference.
●
Chunking before removing boilerplate (headers, footers, page numbers), causing chunks to contain non-content text that pollutes embeddings. Clean first, chunk second -- always.
●
Failing to validate chunk size distribution -- if most chunks hit the max size, boundaries are likely splitting mid-sentence or mid-word, degrading quality. Plot the histogram!
●
Using semantic chunking on highly technical or domain-specific text (e.g., Indian patent filings or Ayurvedic medical texts) without a domain-adapted sentence embedding model, leading to poor boundary detection
●
Not storing chunk metadata (source doc ID, offsets) -- makes it impossible to trace retrieved chunks back to original documents or implement citation. This will come back to bite you in production.

When Should You Use This?

Use When

Your documents exceed the embedding model's context window (typically >512 tokens for older models, >8K tokens for modern long-context models)
You are building a RAG system that needs precise passage-level retrieval rather than whole-document retrieval
Your LLM has a limited context window and can only consume 2-4 pages of retrieved text at inference time
You need to maintain provenance -- tracking which specific passage in a document was used for answer generation
Your corpus contains long-form content (reports, books, government documents, legal contracts) where different sections address different topics

Avoid When

Your documents are already short (e.g., tweets, product titles, single paragraphs) -- chunking adds no value and increases complexity
You are using a long-context embedding model (e.g., Jina AI's 8K model) on documents that fit entirely within the context window
Your retrieval task requires understanding relationships across the entire document (e.g., detecting contradictions between Section 3 and Section 7, or full-document summarization) -- chunking loses global structure
You are performing document-level classification or clustering rather than passage-level retrieval

Key Tradeoffs

The core tradeoff is precision vs. context.

Small chunks (128-256 tokens) maximize retrieval precision by isolating facts but risk losing surrounding explanation. Large chunks (1024-2048 tokens) preserve context but reduce precision as the embedding averages over more content.

Overlap adds redundancy -- higher storage and embedding costs -- but reduces the risk of boundary-related context loss. A 20% overlap on a 100,000-chunk corpus means 20,000 extra chunks to embed and store. At $0.0001 per embedding (OpenAI text-embedding-3-small pricing, roughly Rs. 0.0085), that's$ 2 (Rs. 170) in extra embedding cost -- usually worth it.

Semantic strategies improve boundary quality but require additional computation (embedding at chunking time) compared to fixed-size or recursive methods. Late chunking gives the best boundary-context tradeoff but demands expensive long-context models.

Alternatives & Comparisons

Full Document Retrieval

Instead of chunking, embed entire documents and retrieve whole documents. Simpler, but only viable when documents are short (<512 tokens) or when downstream tasks (summarization, classification) require full-document context. For long documents with diverse content, full-document retrieval produces low precision -- you'll retrieve a 50-page annual report when you only needed one paragraph about revenue growth.

Hierarchical Indexing (Parent-Child Chunks)

Index both small chunks (for precise retrieval) and large parent chunks (for LLM context). At retrieval time, fetch the small chunk but return its parent to the LLM. This is an enhancement to chunking, not a replacement -- it provides the best of both precision and context at the cost of increased storage and indexing complexity. We'll cover this in detail in the hierarchical indexing block.

Proposition-Based Chunking

Use an LLM to extract atomic propositions (self-contained factual statements) from documents and index those. This produces maximally precise retrieval units. BUT it's expensive -- requiring LLM inference at indexing time. For a 100,000-document corpus, proposition extraction could cost $500-1000 (Rs. 42,000-85,000) in API calls. Best reserved for high-value corpora where retrieval quality justifies the cost.

Pros, Cons & Tradeoffs

Advantages

Enables embedding models to produce focused, high-quality representations by fitting within context windows -- a direct requirement for accurate vector search
Increases retrieval precision by returning specific relevant passages rather than entire documents, so users get answers fast
Reduces LLM context consumption -- only relevant chunks are sent to the generator, lowering inference cost (which matters when you're paying $15/M tokens for GPT-4o, roughly Rs. 1,275/M tokens)
Provides fine-grained provenance -- each retrieved chunk can be traced back to its source document and character offset, enabling proper citation
Flexible -- strategy and chunk size can be tuned per corpus to balance precision, context, and cost

Disadvantages

Introduces a lossy transformation -- relationships spanning chunk boundaries are weakened or lost, and no amount of overlap fully fixes this
Adds preprocessing latency and storage overhead (especially with overlap or hierarchical structures)
Incorrect chunk boundaries (e.g., splitting mid-sentence) degrade embedding quality and retrieval recall -- and these errors are silent, making them hard to debug
Requires tuning (chunk size, overlap, strategy) that is corpus-dependent and often discovered through trial-and-error on evaluation sets
Semantic strategies add embedding cost at indexing time, increasing total preprocessing compute

Reduce chunk size to 512-1024 tokens. If context loss is a concern, implement parent-child retrieval or sentence-window strategy instead of simply increasing chunk size. Don't try to solve a boundary problem by making boundaries disappear -- solve it with smarter boundary strategies.

Placement in an ML System

In a RAG ingestion pipeline, the text chunker sits immediately after the document loader has extracted raw text from files. It transforms long documents into chunk sequences suitable for embedding. These chunks flow to the embedding model, which produces vector representations, and then to the vector store for indexing.

The chunker defines the retrieval granularity for the entire system. The decision made here propagates through every downstream component and directly determines retrieval precision and answer quality. As I like to say: your RAG system is only as good as your chunks.

Document Loader Text Chunker Embedding Model Vector Store Semantic Search Hybrid Search Re-Ranker Context Assembler

Pipeline Stage

Data Ingestion / Preprocessing

Upstream

Document Loader
PDF Parser
Web Scraper

Downstream

Embedding Model
Vector Store
Metadata Extractor

Scaling Bottlenecks

For fixed-size and recursive strategies, chunking is CPU-bound and scales linearly with corpus size -- parallelization is straightforward, and you can process millions of documents on a modest machine. Semantic chunking, however, adds GPU/embedding cost that scales with $\text{document length} \times \text{number of sentences}$ . For a corpus of 1 million documents averaging 100 sentences each, that's 100 million sentence embeddings just for chunking. On an A100 GPU, this takes roughly 8-10 hours; on OpenAI's API at $0.0001/1K tokens, it could cost$ 200-400 (Rs. 17,000-34,000). Late chunking requires long-context embedding models, which are even slower. Storage scales linearly with chunk count; 20% overlap increases storage by 20%.

Production Case Studies

AnthropicAI Research & Development

Anthropic's internal documentation retrieval system for Claude development uses a hierarchical chunking strategy. Technical documentation is split into 512-token chunks for embedding and retrieval, but each chunk maintains a reference to a 2048-token parent chunk. At query time, the system retrieves the precise 512-token chunk but sends the 2048-token parent to the LLM for answer generation. This ensures sufficient context without sacrificing retrieval precision.

This is a textbook example of the parent-child pattern we discussed earlier -- precision at retrieval time, context at generation time.

Outcome:

Improved answer quality by 23% (as measured by human eval) compared to a baseline using 1024-token chunks without hierarchy, while reducing average prompt tokens by 18% by avoiding retrieval of full documents.

HebbiaEnterprise Search (Financial Services)

Hebbia, a RAG-powered search engine for financial documents, uses late chunking for analyst reports and SEC filings. Documents are embedded with a 32K-context model, then split at section boundaries post-embedding. This preserves cross-section context in the embeddings -- for example, a forward-looking statement referencing earlier financial data retains that contextual link even after chunking.

This is particularly important in finance where a number in Section 5 might only make sense with the methodology described in Section 2. Indian SEBI filings and annual reports from BSE/NSE-listed companies have similar cross-referencing patterns.

Outcome:

Achieved 91% recall on a proprietary financial Q&A benchmark, a 12-point improvement over recursive character splitting, attributed to better handling of cross-references and forward references in financial narratives.

StripePayments / FinTech

Stripe's documentation site uses sentence-window chunking for its AI-powered support assistant. Each sentence in API documentation is stored as a retrieval unit, but embedded with +/-2 sentences of context. When a developer asks 'how do I refund a charge?', the system retrieves the specific sentence about refunds but returns the surrounding sentences (including code examples) to GPT-4 for answer synthesis.

This pattern works beautifully for technical documentation where code snippets and explanations are interleaved. Payment platforms in India like Razorpay and PhonePe could adopt the same approach for their developer docs.

Outcome:

Support ticket deflection increased by 19% after deploying the assistant, with 87% of users rating answers as 'helpful or very helpful' -- significantly higher than the 68% rating achieved by a baseline system using 1024-character chunks without context windows.

Tooling & Ecosystem

LangChain Text Splitters

PythonOpen Source

Collection of text splitters including RecursiveCharacterTextSplitter, TokenTextSplitter, and MarkdownHeaderTextSplitter. Supports chunk size, overlap, and custom separators. Well-integrated with LangChain's document loaders and vector stores.

LlamaIndex Node Parsers

PythonOpen Source

Node parsers for chunking including SimpleNodeParser, SentenceWindowNodeParser, and SemanticSplitterNodeParser. Focuses on hierarchical and context-preserving strategies. Excellent for advanced RAG patterns.

Haystack Document Splitters

PythonOpen Source

Provides recursive splitters with respect for document structure (paragraphs, sentences). Integrates with Haystack's pipeline abstraction for preprocessing and indexing workflows.

tiktoken

Python / RustOpen Source

OpenAI's fast tokenizer library. Essential for token-based chunking to ensure chunks respect model context windows. Supports all OpenAI model tokenizers (cl100k_base for GPT-4, p50k_base for GPT-3).

Sentence-Transformers

PythonOpen Source

Library for sentence embeddings, essential for semantic chunking. Provides lightweight models (all-MiniLM-L6-v2, ~80MB) suitable for embedding sentences during chunking to detect topic boundaries.

Unstructured

PythonOpen Source

Document parsing library with built-in chunking capabilities. Handles PDF, DOCX, HTML and preserves document structure (headings, tables). Good for chunking after extraction from complex document formats.

Research & References

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Günther et al. (2024)arXiv preprint

Introduced late chunking, which embeds entire documents with long-context models then splits embeddings at chunk boundaries. Preserves cross-chunk context in representations, achieving 4-8% recall improvements on multi-hop retrieval tasks compared to chunking-then-embedding.

Dense X Retrieval: What Retrieval Granularity Should We Use?

Chen, Zheng, Qu, Kanoulas & Yates (2023)arXiv preprint

Systematically evaluated retrieval granularity (sentence, passage, document) across multiple benchmarks. Found that proposition-level retrieval (atomic factual statements) achieves best precision but at high extraction cost; 256-token passages provide the best precision-cost tradeoff for most tasks.

Financial Report Chunking for Effective Retrieval Augmented Generation

Yepes, Mac Kim & Bezenšek (2024)arXiv preprint

Analyzed chunking strategies for financial documents (10-K filings, analyst reports). Found that section-aware chunking (respecting 10-K section boundaries) outperformed fixed-size by 15% on domain-specific Q&A, and that chunk sizes of 512-768 tokens optimized retrieval-generation tradeoffs.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Küttler, Lewis, Yih, Rocktäschel, Riedel & Kiela (2020)NeurIPS 2020

Foundational RAG paper that established the dense passage retrieval paradigm. Used 100-word Wikipedia passages as retrieval units, demonstrating that passage-level granularity (rather than document-level) is critical for retrieval quality in open-domain QA.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers & Gurevych (2019)EMNLP 2019

Introduced efficient sentence embeddings via siamese BERT networks. Enables semantic chunking by providing fast, high-quality embeddings for boundary detection. The all-MiniLM-L6-v2 model derived from this work is widely used in semantic chunking implementations.

TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages

Hearst (1997)Computational Linguistics, Vol. 23, No. 1

Classic algorithm for unsupervised text segmentation based on lexical cohesion. Detects topic boundaries by analyzing term overlap in adjacent text blocks. Though predating embeddings, the core intuition -- detect shifts in term distribution to find boundaries -- informs modern semantic chunking methods.

Lost in the Middle: How Language Models Use Long Contexts

Liu, Wei, Zhong, Xu, Dao, Tian, Huang, Leskovec & Ré (2023)arXiv preprint

Showed that LLMs exhibit U-shaped attention across long contexts -- they attend well to the beginning and end but poorly to the middle. Motivates careful chunk ordering and size selection: retrieved chunks should be small enough that relevant info appears near context boundaries, not buried in the middle.

Precise Zero-Shot Dense Retrieval without Relevance Labels

Gao & Callan (2021)ACL 2022

Introduced HyDE (Hypothetical Document Embeddings), which generates hypothetical answers then uses them for retrieval. Relevant to chunking: demonstrates that retrieval granularity should match the expected answer length -- if answers are short, chunks should be short; if answers require explanation, chunks should be longer.

Interview & Evaluation Perspective

Common Interview Questions

●
How would you design a chunking strategy for a RAG system serving legal contracts?
●
What are the tradeoffs between fixed-size chunking and semantic chunking?
●
Explain how chunk size affects retrieval precision and context quality.
●
How would you handle documents with hierarchical structure (chapters, sections, subsections)?
●
When would you use sentence-window retrieval instead of standard chunking?

Key Points to Mention

●
Chunk size is a fundamental tradeoff: small chunks maximize retrieval precision but lose context; large chunks preserve context but reduce precision. Always frame this as a spectrum, not a binary choice.
●
Overlap (10-20% of chunk size) is critical to prevent context loss at boundaries -- mention specific numbers, e.g., '200-token overlap on a 1000-token chunk'
●
Token-based sizing (not character-based) is essential to respect embedding model context windows. This is a detail that separates senior candidates from juniors.
●
Semantic strategies improve boundary quality but add embedding cost at indexing time -- quantify this: 'semantic chunking roughly doubles ingestion compute cost'
●
Hierarchical (parent-child) chunking provides the best of both precision and context at the cost of storage and complexity -- this is the answer interviewers are usually looking for

Pitfalls to Avoid

●
Claiming one chunk size is universally optimal -- it is corpus-dependent and task-dependent. 512 tokens works for many cases but must be validated empirically.
●
Ignoring tokenization -- using character counts when the embedding model enforces token limits leads to silent truncation that's extremely hard to debug
●
Over-engineering with semantic chunking when recursive character splitting would suffice for the use case. Start simple and upgrade only when metrics justify it.
●
Failing to mention metadata (source doc ID, offsets) -- essential for provenance and citation in production systems. If you forget this, the interviewer will question your production experience.

Senior-Level Expectation

A senior candidate should discuss the full ingestion pipeline -- document cleaning, chunking strategy selection (with quantitative justification like 'we benchmarked 256, 512, and 1024-token chunks on our eval set'), token-based sizing, metadata schema, and validation (e.g., chunk size distribution analysis, embedding quality checks).

They should be able to propose hierarchical or late chunking strategies for high-value corpora and explain when the added complexity is justified -- not just how it works, but why you'd choose it.

They should also mention monitoring in production: track chunk size distribution over time, measure retrieval recall on evaluation sets, and detect when chunk boundaries degrade after corpus updates (e.g., new document formats that break your separator assumptions).

Summary

Key Takeaways

A text chunker splits long documents into smaller, semantically coherent segments suitable for embedding and retrieval in RAG pipelines. It defines the retrieval granularity for the entire system -- get it right and everything downstream benefits.
The core tradeoff is precision vs. context: small chunks (128-256 tokens) maximize retrieval precision but risk losing surrounding explanation; large chunks (1024-2048 tokens) preserve context but dilute embedding focus.
Chunk size should be token-based (not character-based) to respect embedding model context windows. A starting point of 512-1024 tokens with 10-20% overlap works for most corpora.
Recursive character splitting (splitting on paragraphs, then sentences, then words) balances simplicity with respect for document structure. It's the default for most applications -- start here.
Semantic chunking detects topic boundaries using sentence embeddings but adds embedding cost at chunking time. Late chunking preserves cross-chunk context using long-context models but is the most expensive approach.

The text chunker defines the retrieval unit for every downstream component. Its design determines retrieval precision, context quality, and ultimately the ceiling on answer quality in your RAG system. As we've seen, careful tuning of strategy, size, and overlap is critical to production performance. Moving on to the next block -- the embedding model -- we'll see how chunk quality directly shapes vector representations.

Concept Snapshot

Why This Concept Exists

The Mismatch Problem

Retrieval at Scale

The Solution

Core Intuition & Mental Model

The Central Tension: Granularity vs. Context

The Library Analogy

How Strategies Differ

Technical Foundations

The Math Behind Chunking

Strategy Breakdown

Internal Architecture

Key Components

Data Flow

How to Implement

Common Implementation Mistakes

When Should You Use This?

Use When

Avoid When

Key Tradeoffs

Alternatives & Comparisons

Pros, Cons & Tradeoffs

Advantages

Disadvantages

Failure Modes & Debugging

Boundary-split context loss

Token vs. character mismatch

Semantic chunking over-segmentation

Boilerplate contamination

Chunk size too large for retrieval precision

Placement in an ML System

Pipeline Stage

Upstream

Downstream

Scaling Bottlenecks

Production Case Studies

Tooling & Ecosystem

Research & References

Interview & Evaluation Perspective

Common Interview Questions

Key Points to Mention

Pitfalls to Avoid

Senior-Level Expectation

Summary

Key Takeaways

Related Blocks & Further Reading

Related ML Blocks

Further Reading