RAG Chunk Estimator — Plan Retrieval-Augmented Generation Costs

🔒 Runs in your browser — nothing is sent to a server

RAG chunk estimator that walks a retrieval-augmented generation pipeline end-to-end: source document count and size, chunk size with overlap, embedding model, query volume, top-K retrieval, LLM answer model. It returns the total number of chunks, total tokens to embed, the one-time index-build bill, and the per-query cost broken down into query embedding plus LLM call. A context-fit check verifies that system prompt plus top-K chunks plus answer reservation actually fits the chosen LLM's context window — overflow is the most common bug when scaling retrieval. Useful for sizing a RAG project before signing for a vector store, picking a chunk size that balances retrieval quality and LLM input cost, and estimating monthly burn under realistic traffic.

Pricing snapshot: May 16, 2026
Chunks
15,000
15 per document · 7.50M total tokens to embed
Index build (one-time)
$0.1500
text-embedding-3-small @ $0.020/M
Per query (embed + LLM)
$0.0148
$0.00000080 embed + $0.0148 LLM
Query-time projections
Per day (1,000 queries)$14.82
Per month (30 days)$444.62
Per year$5,335.49

Pricing snapshot: 2026-05-16

Context fit check · Claude Sonnet 4.6
LLM input (sys + topK × chunk + query)2,940
+ answer reserved400
Total vs window3,340 / 1.00M
Fits — uses 0.33% of window

How chunk size moves the cost curve

Two regimes. Small chunks (200–500 tokens): high chunk count, high index build cost, but small per-query LLM input. Best when retrieval precision matters and you can afford to re-rank. Large chunks (1500–5000 tokens): few chunks, cheap index, but each query pulls in a lot of LLM context — making the LLM call expensive and risking context overflow. Best when documents are coherent and the answer must reference whole sections. Most production systems land at 500–1000 tokens with 50–100 overlap, top-K 4–6. Use the calculator to verify the choice fits your context budget on the answer model.

The hidden cost: query growth

The biggest source of RAG bill surprises is query growth, not corpus growth. Doubling the corpus doubles the one-time index cost — a few dollars to a few hundred, paid once. Doubling daily queries doubles the LLM bill — which dominates the budget. When sizing a RAG project, project query volume out 12–24 months and pick LLM and chunking strategy for that scale, not the launch traffic. The cost-per-query line on this calculator times your projected QPS is the number that drives the architecture, not the chunk count.

Examples

Input
1,000 docs × 10 pages × 6,670 tokens, chunk 500, overlap 50, top-K 5
Output
~14,800 chunks. Index build ~$0.15 on text-embedding-3-small. Per query: ~$0.012 LLM + ~$0.0000008 embed → $36/month at 100 queries/day.
Internal help centre — 1,000 pages, 500-token chunks, Claude Sonnet 4.6
Input
500 docs × 5 pages, chunk 5,000, overlap 500, top-K 3
Output
~370 chunks. Heavy index but cheap one-time. Per query injects 15K input tokens → ~$0.020/query on Gemini 2.5 Pro.
Customer docs over 5K-token chunks for Gemini 2.5 Pro
Input
100,000 files × 200 tokens, chunk 256, overlap 64, top-K 8
Output
~104K chunks. Index build ~$0.53 on text-embedding-3-small. 800K embeddings × 6 KB ≈ 5 GB raw fp32 — quantize before storing.
Code search over 100K files — small chunks, lots of overlap

FAQ

What chunk size should I pick for RAG?

The honest answer is "measure your own corpus", but useful defaults: 200–500 tokens for short conversational answers, 500–1000 for documentation lookup, 1500–3000 when the model needs to see whole sections at once. Smaller chunks improve precision (the retrieved passage stays focused) but require larger top-K and more retrieval round-trips. Larger chunks waste LLM context on irrelevant text but reduce ranking overhead. The estimator on this page makes the trade-off visible — increasing chunk size shrinks chunk count and index cost but inflates per-query LLM input.

Why do I need overlap between chunks?

Chunks cut the document at arbitrary boundaries; a sentence relevant to the user query might straddle the cut, leaving half the answer in one chunk and half in the next. A 10–20% overlap (50–100 tokens for a 500-token chunk) buys robustness: the relevant passage is much more likely to land entirely inside at least one chunk. The overhead is small — overlap inflates total embed tokens by the overlap fraction. The estimator handles overlap arithmetic automatically.

What does top-K mean and how do I choose it?

Top-K is the number of most-similar chunks the vector store returns for each query, which then get injected into the LLM prompt. K=3 to K=8 is the typical range. Higher K improves recall (you're less likely to miss the right passage) but multiplies LLM input cost linearly. Most production systems pick K based on the model context budget — if 8 × 500 = 4K extra tokens per query is acceptable for your model and cost target, K=8; otherwise drop.

How does the calculator estimate the number of chunks?

For each document of length D tokens, with chunk size C and overlap O, it computes the sliding-window count: 1 if D ≤ C, else ⌈(D − O) / (C − O)⌉. Total chunks equals documents times chunks-per-doc. Total embed tokens equals total chunks times chunk size (slight upper bound — the final chunk of a document may be shorter, but the approximation is within a few percent and conservative).

Why is per-query cost dominated by the LLM call, not the embedding?

Embedding a single query is roughly one millicent — 40 tokens × $0.02/M = $0.0000008. The LLM answer call processes top-K chunks (often 2,500–5,000 tokens) plus a system prompt and an answer (typically 200–800 tokens output) — easily 100,000× more tokens at 10–100× the rate. Optimising RAG cost means optimising the LLM input: fewer chunks, smaller chunks, cheaper LLM, prompt caching for the static parts.

Does the estimator account for vector storage cost?

Only as a raw byte estimate (dimensions × 4 bytes per vector × number of vectors). Real vector databases — Pinecone, Qdrant Cloud, Weaviate, pgvector — apply their own overhead and pricing tiers, often $0.10–$1 per million vectors stored per month. Production storage rarely dominates the bill (under 10% of total RAG spend); accurate vector storage modelling is on the MVP-2 roadmap. For now, multiply raw byte estimates by 2–3× as a rough provisioning factor.

How do I reduce my RAG bill?

Five proven levers, roughly in order of impact: (1) cache the static system prompt and any always-injected context — saves 50–90% on Anthropic/OpenAI for the first hit; (2) drop top-K from 8 → 5 once you measure retrieval quality is good enough; (3) route easy queries to a cheaper LLM (Haiku or Flash) and reserve the flagship for hard ones; (4) increase chunk size when documents are coherent so you fetch fewer chunks per query; (5) re-rank with a small cross-encoder before sending to the LLM so top-K can be smaller.

Glossary

Retrieval-Augmented Generation (RAG)

RAG is an architecture pattern that augments an LLM with retrieval from an external knowledge base. At query time, the user question is embedded, the most similar passages are fetched from a vector store, and those passages are injected into the LLM prompt as additional context. RAG gives LLMs access to up-to-date or private information without fine-tuning, at the cost of a per-query retrieval step.

Chunking

Chunking is the preprocessing step that splits source documents into smaller pieces sized for embedding and retrieval. Common chunk sizes are 200–1000 tokens with a 10–20% overlap between adjacent chunks. The chunking strategy is the single largest design choice in a RAG pipeline — it determines retrieval precision, recall, and LLM input cost.

Top-K retrieval

Top-K retrieval is the operation of fetching the K most similar chunks to a query from the vector store. K is typically 3–8. Higher K improves recall but multiplies LLM input tokens linearly. Many production systems use two-stage retrieval: a large top-K (e.g. 50) from the vector store, then a small top-K (e.g. 5) after a cross-encoder re-rank.

Chunk overlap

Chunk overlap is the number of tokens shared between adjacent chunks of the same document. A 50-token overlap on a 500-token chunk means each chunk repeats the last 50 tokens of the previous one. Overlap protects against passages that straddle a chunk boundary, at the cost of inflating total embed tokens by the overlap fraction.

Index build

Index build is the one-time process of embedding every chunk in a corpus and writing the vectors to the vector store. The bill is the total embed tokens times the embedding model rate. After the initial build, only changed documents need re-embedding, plus the per-query embedding of incoming user questions.

Related tools