Just nowRead time 9 min

Every 512 tokens is not a chunking strategy: a 2026 practical guide to choosing how to split your documents for RAG

Chunking is the single highest-leverage and most under-treated decision in a RAG pipeline, and most teams leave it on the default. Here is a source-checked 2026 guide to the five chunking strategies that actually matter — fixed, recursive, semantic, late, and proposition-based — when to use each, the retrieval-quality tradeoffs, and why the right answer is never 'whatever the tutorial used.'

AI developer-tools Model news

RAG chunking strategy practical cover

This is the second piece in the production RAG cluster, following RAG did not solve hallucinations — it moved them. That piece named chunking as the number-one failure mode in production RAG; this one is the practical "so what do I do about it." If the diagnosis piece told you where RAG breaks, this piece tells you how to fix the layer that breaks most often.

My read, after going through the chunking literature, is blunt: "chunk every 512 tokens with a 50-token overlap" is the most common chunking decision in the wild, and it is almost never the right one. It is a default that ships with every tutorial, and teams leave it on because chunking feels like plumbing rather than a real engineering decision. It is a real engineering decision — arguably the highest-leverage one in the whole pipeline — and treating it as a default is how you ship a RAG system that works on a demo corpus and silently returns fragments of arguments on the real one.

Why chunking is the highest-leverage decision

Recall the data point from the diagnosis piece: when RAG fails in production, roughly 73% of the time the failure is in retrieval, not generation. And the single biggest input to retrieval quality is how your documents were chunked in the first place. Everything downstream — embedding choice, retrieval method, reranking — operates on the chunks you produced. Garbage chunks in, garbage context out, no matter how good your retriever is.

The reason chunking matters so much is that it determines the unit of meaning your retriever can return. If a chunk boundary splits a coherent argument across two chunks, neither chunk is a useful retrieval result for a question about that argument — they are both substantively incomplete, even if both are semantically near the query. The retriever is not broken; it was handed broken units. This is why chunking is upstream of every other retrieval decision: it sets the ceiling on retrieval quality that no amount of reranking or hybrid search can lift.

The honest framing: chunking is not a preprocessing step you set once and forget. It is an engineering decision you make, measure, and revisit. The right strategy depends on your document type, your query patterns, and your embedding model — and it changes as those change.

The five chunking strategies that actually matter

Drawing on the Firecrawl, Digital Applied, Redis, and Pinecone practitioner guides, here are the strategies worth knowing, when each wins, and the tradeoff each carries.

1. Fixed-size chunking (the default you should beat)

Split the document every N tokens, with some overlap. Simple, fast, requires no understanding of document structure.

Use when: you are prototyping, your documents are short and uniform, or you need a baseline to beat.
Tradeoff: splits mid-thought. A 512-token window can cut a coherent argument in half, leaving both chunks substantively incomplete. This is the failure mode that causes "the retriever returned something relevant but useless."
When to move off it: as soon as you have a real corpus and a real eval set. If a smarter chunking strategy beats fixed-size on retrieval recall, switch.

2. Recursive chunking (structure-aware splitting)

Split on structural boundaries first — Markdown headers, HTML tags, code blocks, paragraph breaks — and only fall back to token-count within a structural unit.

Use when: your documents have meaningful structure (Markdown docs, HTML pages, code, technical specs). This is the workhorse for mixed-format corpora.
Tradeoff: still heuristic. It respects structure better than fixed-size, but it can over-split when structural boundaries are dense (lots of short headers) or under-split when they are sparse.
Why it usually beats fixed-size: it preserves the structural unit as the retrieval unit. A section under a header is usually a more coherent retrieval result than an arbitrary 512-token window.

3. Semantic chunking (meaning-aware splitting)

Use an embedding model to detect topic shifts within the document and split at those boundaries, so each chunk is topically coherent.

Use when: your documents are narrative prose where topic boundaries matter more than structural boundaries — essays, research papers, long-form articles, transcripts.
Tradeoff: slower at ingestion time and depends on embedding model quality. A weak embedding model produces weak semantic boundaries, which gives you the worst of both worlds — the cost of semantic chunking with the coherence of fixed-size.
Why it can beat recursive: it splits on meaning, not on markup. For prose without strong structural signals, semantic boundaries align better with what a query is actually looking for.

4. Late chunking (embed-first, chunk-second)

Embed the full document first — using a long-context embedding model — and then carve chunks out of the resulting embedding. Every chunk's representation accounts for the entire document's context, not just the tokens inside the chunk.

Use when: you have long documents where context bleeds across sections — legal contracts, research papers, technical specifications where "Section 3" only makes sense after "Section 2."
Tradeoff: higher compute cost and requires a long-context embedding model that supports this pattern. You are paying for richer representations with more ingestion work.
Why it can beat everything else for long docs: it solves the "context bleed" problem that every other strategy has. A chunk from the middle of a document, embedded with full-document context, is a better retrieval unit than the same chunk embedded in isolation.

5. Proposition-based / LLM-based chunking (precision extraction)

Use an LLM to extract self-contained propositions or factual statements from the document, and index those propositions instead of text chunks.

Use when: you are doing precision question-answering over complex documents where each retrievable unit should be a discrete, verifiable claim — medical, legal, compliance, or scientific QA.
Tradeoff: the most expensive option at ingestion time (you run an LLM over every document), the slowest to index, and the hardest to debug. Reserve it for cases where precision is worth the cost.
Why it is the precision ceiling: a proposition is the smallest meaningful retrieval unit. When it works, retrieval returns exactly the claim that answers the query, with no surrounding noise.

A decision table, not a leaderboard

The mistake is treating these as ranked (propositions "best," fixed "worst") and picking the top of the list. They are not ranked; they are matched. Here is how I would decide:

If your documents are...	Lean toward...	Because...
Short, uniform, prototyping	Fixed-size	You need a baseline, and structure does not matter yet
Structured (Markdown, HTML, code)	Recursive	Structural units are coherent retrieval units
Long-form prose, topic-driven	Semantic	Topic boundaries align with query intent
Long docs with cross-section context	Late chunking	Full-document context fixes "context bleed"
Precision QA over complex claims	Proposition-based	Each retrievable unit is a discrete verifiable claim

The decision is not "which is best in general." It is "which matches my documents and queries." A strategy that wins on legal contracts can lose on chat transcripts. Measure on your data.

How to actually choose (and know you chose right)

The practical selection process:

Start from your document type, not from the strategy. Pick a candidate strategy from the table above based on what your corpus actually looks like.
Build a small retrieval eval set. Fifty to a hundred queries with known-relevant chunks (this connects directly to the golden set discipline from our eval cluster).
Measure retrieval recall and precision for each candidate strategy. Not generation quality — retrieval quality. The question is "did the right chunk come back," not "was the answer good," because chunking only affects the retrieval layer.
Pick the winner and re-measure in production. Offline eval tells you what is plausible; production traffic tells you what is true. Re-run when your corpus changes significantly.
Revisit when you change the embedding model. Chunking and embedding interact: the chunking strategy that won with one embedding model can lose with another, because what counts as a "coherent unit" depends on how the embedding model represents text.

The sharp edges that are not in the marketing copy

A few risks worth knowing:

A bigger chunk is not a better chunk. Long chunks give the model more context but dilute retrieval precision — the retriever has more chances to return a chunk that is mostly irrelevant with a relevant paragraph buried in it. Chunk size is a precision/recall tradeoff, not a "bigger is better" dial.
Overlap is not a substitute for good boundaries. A 50-token overlap between fixed chunks does not restore the argument you split in half; it just means both fragments repeat some tokens. Overlap mitigates boundary losses; it does not eliminate them.
Semantic chunking inherits your embedding model's blind spots. If your embedding model conflates two distinct topics, semantic chunking will not split them. The strategy is only as good as the model that drives it.
Late chunking is not free even when your embedding model supports it. Long-context embedding is more expensive per document, and the indexing pipeline is more complex. Budget for it before committing.
Proposition-based chunking can lose context the model needed. By reducing documents to discrete claims, you can strip the connective tissue that makes some answers possible. It is a precision tool, not a universal default.

How this connects to the rest of the RAG stack

This piece is the second in the RAG cluster and connects outward:

It is the practical follow-up to the RAG failure diagnosis piece, which named chunking as failure mode #1.
It connects to the LLM evaluation cluster: the retrieval metrics that tell you whether a chunking strategy is working (context precision, context recall) are the same metrics that diagnose RAG failures generally.
It connects to the golden set piece: choosing a chunking strategy requires a retrieval eval set, which is a golden set applied to the retrieval layer.
It connects to the per-task cost observability piece: late chunking and proposition-based chunking both add ingestion cost, and whether that cost is worth it is a per-task cost question.

My take

The 2026 story is that chunking is treated as plumbing when it is actually the load-bearing decision in a RAG pipeline. The teams whose RAG retrieves well are not the ones with the most expensive embedding model or the fanciest reranker. They are the ones who matched their chunking strategy to their document type, measured retrieval recall on a real eval set, and treated "every 512 tokens" as a baseline to beat rather than a final answer.

If you take one thing from this piece: chunking is upstream of every other retrieval decision, and the default is almost never the right answer for a real corpus. Pick a strategy based on your documents, measure it, and revisit it when your corpus or embedding model changes.

This is the second piece in the production RAG cluster. Start with RAG did not solve hallucinations — it moved them for the failure diagnosis that names chunking as the top failure mode, then this piece for how to choose a chunking strategy, then A RAG system without evaluation is a guess for the evaluation that tells you whether your chunking is working. For the broader evaluation discipline, see the LLM evaluation cluster. For a maintained provider reference, see our AI pricing data page.

Sources

There is no best LLM in 2026: a production guide to choosing the right model for the right task

The question 'which LLM is best?' is the wrong question in 2026. There is no best model — there is the best model for your specific task, at your specific scale, under your specific constraints. Here is a source-checked guide to production LLM model selection: the four frontier families (GPT, Claude, Gemini, DeepSeek), which task each wins, the four hard constraints that frame every decision (privacy, latency, cost, reasoning depth), and why the dominant 2026 pattern is model routing — using multiple models in parallel, not picking one winner.

Open source is not free: a 2026 TCO guide to self-hosting vs API LLMs in production

The pitch is seductive: self-host an open-source LLM and stop paying per-token API fees. The reality is that a minimal self-hosted deployment can cost $125K–$190K per year, and production-scale deployments can reach millions. Here is a source-checked 2026 guide to the true cost of ownership for open-source vs commercial LLMs: the hidden costs of self-hosting (GPU, ops, inference optimization, downtime), the break-even volume where self-hosting wins, and why most teams should start with API and move to self-hosting only when the math genuinely justifies it.