Prompt Caching Savings Calculator

๐Ÿ”’ Runs in your browser โ€” nothing is sent to a server

Prompt caching savings calculator for Claude, GPT and Gemini. Feed in the size of the cacheable prefix (system prompt, RAG context, few-shot demos), the size of the per-request dynamic input, the cache hit rate you expect, and the daily request volume. The calculator returns per-request, per-day, per-month and per-year savings against the uncached baseline. Cache reads are billed at 10% of standard on Anthropic, 50% on OpenAI, 25% on Google โ€” this calculator picks the right rate automatically for each model. Useful when deciding whether the engineering effort of structuring a cache-friendly prompt pays off at your traffic volume.

Pricing snapshot: May 16, 2026
80%
Without caching
$0.0735
per request
With caching
$0.0333
per request
You save
$0.0402 (54.7%)
per request
Volume projections
Per day (5,000 req)$201.00
Per month (30 days)$6,030.00
Per year$72,360.00
Rate breakdown (Claude Sonnet 4.6)
Standard input$3.00/M
Cache read$0.30/M (90% off)
Cache write (first call)$3.75/M
Output$15.00/M
Snapshot: 2026-05-16
Worked example

At a 80% cache hit rate, 20.0K static tokens are billed at $0.30/M instead of $3.00/M for 80% of requests. The dynamic 500 tokens per request keep paying the standard input rate. That nets $0.0402 per request, or $6,030.00 per month at 5,000 req/day.

Provider-by-provider cheat sheet

Anthropic Claude: 90% discount on cache reads, 25% premium on cache writes, 5-minute default TTL, breakpoints declared explicitly with `cache_control: { type: "ephemeral" }` markers. OpenAI: automatic caching for prompts โ‰ฅ1,024 tokens, 50% discount on cache reads, no write premium, no manual control โ€” the provider hashes the prefix automatically. Google Gemini: explicit "context caching" API where you create a cached content object and reuse it across requests, 75% discount, configurable TTL from minutes to hours, useful for shared corpora rather than per-user prompts.

When the savings model breaks

Three failure modes worth knowing. First, if your dynamic content sits at the start of the prompt instead of the end, the cache never hits โ€” fix the prompt structure. Second, if the model context window is large but most prompts in production are short, caching cannot help: there is no shared prefix to amortise. Third, if your prompts include timestamps, request IDs or user IDs at the top of the system message (a common debugging habit), every request looks unique to the cache. Move all such fields to the end of the request before measuring caching ROI.

Examples

Input
Static 8,000 tokens, dynamic 500 tokens, output 600 tokens, hit rate 90%
Output
Cache saves ~$0.022/request, ~$1,080/day, ~$32,400/month vs. uncached. Break-even on cache writes: after the second call.
Long system prompt โ€” 8K tokens ร— 50k calls/day on Claude Sonnet 4.6
Input
Static 30,000 tokens, dynamic 200 tokens, output 500 tokens, hit rate 70%
Output
OpenAI cache discount is 90% off โ€” saves ~$0.047/request, ~$33,800/month at 24K requests/day. Without the cache, same workload costs ~$36,000/mo.
Retrieval-heavy assistant on GPT-5.4 โ€” 30K cached context
Input
Static 4,000 tokens, dynamic 300 tokens, output 200 tokens, hit rate 95%
Output
Saves ~$0.0034/request. At 100K req/day โ†’ ~$340/day, ~$10,200/month โ€” caching is worth it even on the cheapest tier when the prefix is reused this heavily.
Static few-shot demos on Claude Haiku 4.5 โ€” high hit rate

FAQ

What is prompt caching in LLM APIs?

Prompt caching lets you mark a static prefix of your request โ€” a long system prompt, retrieved RAG context, few-shot examples โ€” as cache-able. After the first call processes and stores that prefix, subsequent calls within the cache window pay a sharply discounted rate for those tokens instead of the full input price. Anthropic gives a 90% discount on cache reads, OpenAI 50%, Google 75%.

When is prompt caching worth the engineering effort?

Whenever the same prefix is reused two or more times within the cache TTL. Concretely: chatbots with a long system prompt, RAG pipelines where the retrieved context lives in a few hot documents, code-assistant tools that always send the same project skeleton, agentic loops where every tool turn re-sends the same instructions. If your prefix changes every request โ€” fresh per-user data, no shared context โ€” caching can't help.

How is cache hit rate determined?

You decide the structure. Put genuinely static content at the very start of the request (the cache key is a prefix hash) and append dynamic content at the end. Hit rate then equals the fraction of requests whose prefix matches a still-warm cache entry. For chat apps with a global system prompt, this approaches 100%. For per-tenant prompts, it depends on traffic concentration. Measure with the provider's `cached_tokens` response field after rollout.

Do I pay extra to write a cache entry?

On Anthropic, yes โ€” cache writes cost 1.25ร— the normal input rate. On OpenAI and Google the write is free; only reads are discounted. The calculator on this page models Anthropic's write premium automatically when you pick a Claude model. The break-even point for cache writes is usually after the second hit โ€” even on Anthropic, if the prefix is read twice you come out ahead.

How long does a cached prompt stay warm?

Provider-specific: Anthropic's default is 5 minutes idle TTL, extendable to 1 hour or 24 hours with paid tiers. OpenAI keeps cached prefixes for typically 5โ€“10 minutes with no explicit guarantee. Google Gemini context caching is configurable from minutes to hours. Plan your traffic shape around this โ€” caching helps high-throughput workloads more than spiky ones.

Does cache hit rate affect output cost?

No. Caching only reduces input token cost. Output tokens are always billed at the model's standard output rate (or batch rate, if using the Batch API). The calculator separates the two so you can see exactly which portion of the bill is affected โ€” the output line stays constant across cached and uncached scenarios.

Can I combine prompt caching with the Batch API?

Yes โ€” both providers support stacking. Batch API gives ~50% off all token rates; cache discounts compound on top. On Anthropic that means a cached input token in a batch request can cost as little as 5% of the standard non-batch rate. The savings calculator on this page covers cache savings only; for batch see the LLM Cost Calculator and toggle the Batch API checkbox.

Glossary

Prompt cache

A prompt cache is a provider-side store of the activations produced when processing a prompt prefix. On a subsequent request whose prefix matches, the provider skips re-processing and bills the matched tokens at a discounted rate. The cache is keyed on the prefix exactly โ€” even a one-character difference invalidates the entry. Caching is invisible to the model output; only the bill changes.

Cache hit rate

Cache hit rate is the fraction of requests whose prefix is found in the provider's warm cache. A 100% hit rate means every request reuses a stored prefix; 0% means every request writes a new entry. Production chatbots with a shared system prompt commonly see 95%+. Per-user dynamic prompts may see 0โ€“30%. Hit rate directly multiplies the savings number on this calculator.

Cache write premium

Anthropic charges 1.25ร— the standard input rate to write a prompt to the cache; OpenAI and Google charge no write premium. The write premium means caching is slightly more expensive than non-caching for one-shot requests, but pays off as soon as the prefix is read a second time. The break-even point is the second call.

Cache TTL (time-to-live)

Cache TTL is how long a cached prompt stays warm without being read. Anthropic's default is 5 minutes; longer durations (1h, 24h) cost more to write. OpenAI keeps entries on a best-effort basis for ~5โ€“10 minutes. After the TTL expires, the next call writes a fresh entry, which on Anthropic incurs the write premium again.

Cacheable prefix

A cacheable prefix is the leading portion of an LLM request that does not change between calls. The system prompt, a few-shot exemplar block, retrieved RAG context shared across many users, and tool definitions all qualify. Per-user identifiers, timestamps and the actual user message should appear after the cacheable prefix so they don't bust the cache.

Related tools