Ctrl+K

LLM Cost Calculator — Compare API Pricing Across Providers

🔒 Runs in your browser — nothing is sent to a server

LLM cost calculator that compares OpenAI GPT-5.5, Anthropic Claude Opus 4.7, Google Gemini 2.5 Pro, Meta Llama 3.3 and Mistral Large side-by-side for the exact token budget you plan to use. Type input and output tokens per request, pick any subset of models, and the table updates with per-request, daily and monthly cost — sorted from cheapest to priciest. Toggle on cached-input share or the asynchronous Batch tier to model how your bill changes with caching or non-urgent jobs. Everything runs in your browser; no API keys, no signup, nothing logged.

Pricing snapshot: May 16, 2026

Input tokens per request

Output tokens per request

Cached input share (%)

Requests per day

Use Batch API (≈50% discount where supported)

Provider:

Model	Input	Output	Per request	Per day	Per month	vs cheapest
Gemini 2.5 Flash Google · $0.3/M in · $2.5/M out	$0.000600	$0.001250	$0.001850	$1.85	$55.50	cheapest
GPT-5.4 Mini OpenAI · $0.75/M in · $4.5/M out	$0.001500	$0.002250	$0.003750	$3.75	$112.50	2.03×
Claude Haiku 4.5 Anthropic · $1/M in · $5/M out	$0.002000	$0.002500	$0.004500	$4.50	$135.00	2.43×
Claude Sonnet 4.6 Anthropic · $3/M in · $15/M out	$0.006000	$0.007500	$0.0135	$13.50	$405.00	7.30×
GPT-5.5 OpenAI · $5/M in · $30/M out	$0.0100	$0.0150	$0.0250	$25.00	$750.00	13.51×

Per-request tokens

2,000 in · 500 out

Daily volume

1,000 req/day · 2.50M tokens/day

Pricing tier

Standard

Pick the right model for the right workload

There is no universally cheapest LLM. For high-volume customer-support style replies (short input, short output, accuracy-tolerant), Gemini Flash-Lite or Claude Haiku 4.5 typically win. For RAG over moderate context (10–50k input, ~1k output), the long-context discounts on Gemini 2.5 Pro and Claude Sonnet 4.6 often beat the OpenAI flagship. For agentic tool use with structured outputs, GPT-5.5 and Claude Opus 4.7 trade leads depending on the benchmark. Self-hosted Llama 3.3 70B on Together or Anyscale is often 3–10× cheaper at the same workload but with infrastructure overhead. Run this calculator with your real traffic numbers before locking a contract.

Where the savings actually come from

Three levers dominate any LLM bill: (1) shortening output — tightening the system prompt to demand brevity drops cost more than swapping models; (2) caching static prefixes — a long system prompt re-used across 1M requests pays for caching after the second hit; (3) routing easy queries to a cheaper model — a small classifier deciding between Haiku-tier and Opus-tier captures a 10× spread without quality loss on the easy 80%. Model swapping alone is at best a 2–3× win on a homogenous workload; combined with these three, real production deployments see 5–20× cost reduction without quality drop.

Examples

Input

Compare Claude Haiku 4.5, Gemini Flash, GPT-5.4 Mini

Output

Gemini Flash ≈ $0.0019/req → $570/mo. Claude Haiku 4.5 ≈ $0.0045/req → $1,350/mo. GPT-5.4 Mini ≈ $0.0037/req → $1,110/mo.

Customer-support bot · 2,000 in / 500 out · 10,000 req/day

Input

Compare GPT-5.5, Claude Sonnet 4.6, Gemini 2.5 Pro

Output

Gemini 2.5 Pro ≈ $0.071/req → $213/mo. Claude Sonnet 4.6 ≈ $0.162/req → $486/mo. GPT-5.5 ≈ $0.274/req → $822/mo.

RAG over 50k-token context · 50k in / 800 out · 100 req/day

Input

Compare GPT-5.4, Mistral Medium, Claude Haiku 4.5

Output

At Batch (50% off), Mistral Medium ≈ $0.0018/req → $270/mo. Claude Haiku 4.5 Batch ≈ $0.0049/req → $735/mo.

Batch translation · 1,500 in / 1,500 out · 5,000 req/day · Batch API on

FAQ

How does the LLM cost calculator compute the per-request price?

Per-request cost equals input tokens × input rate ÷ 1,000,000 plus output tokens × output rate ÷ 1,000,000. When you set a cached-input share, that fraction of input tokens is billed at the cache-read rate instead of the standard rate (10% of input price on OpenAI and Anthropic, 25% on Google). The Batch toggle replaces both rates with each provider's batch-tier rate where one is published — typically a flat 50% discount.

Why is the same prompt cheaper on Gemini Flash than Claude Haiku?

Different providers price for different workload niches. Gemini 2.5 Flash is tuned for high-volume, low-latency requests at $0.30 input / $2.50 output per million tokens; Claude Haiku 4.5 charges $1.00 / $5.00 with a stronger instruction-following baseline. For a customer-support bot answering FAQs you may not notice the quality difference and save ~3×; for nuanced reasoning Claude often wins on quality per request even if the bill is higher.

What is prompt caching and when is it worth turning on?

Both OpenAI and Anthropic let you mark a static prefix — long system prompt, retrieved RAG context, a few-shot demo — as cache-able. Subsequent calls that share the prefix charge 10% of the normal input rate for those tokens. The break-even is the second call: any prompt re-used twice or more saves money. Set the "cached input share" slider to roughly the percentage of your input tokens that will be re-used by the next call to model this effect.

When should I use the Batch API instead of regular API calls?

When you do not need a synchronous response within seconds. Batch APIs on OpenAI, Anthropic and Google process your requests asynchronously over up to 24 hours and bill 50% less. Suitable for overnight enrichment, document classification jobs, summarising the day's support tickets, embedding a million product descriptions. Not suitable for chat UIs, live coding assistants or any user-facing latency-sensitive flow.

How accurate are these cost estimates compared to my real provider bill?

Within 1–2% on token counts and exact on per-million-token rates as of the snapshot date shown in the table footer. Discrepancies come from: token counting differences (the counter uses an approximation, your provider uses its real tokenizer), reasoning tokens for models like GPT-5.5 Pro or o-series which can dominate output cost, and image/audio tokens which the calculator does not include in MVP-1.

Why are output tokens so much more expensive than input tokens?

Output requires the model to actually generate, which involves running every layer for every token sequentially. Input is processed in parallel and is much cheaper per FLOP. Every major LLM provider prices output 3–6× higher than input — Anthropic uses a flat 5×, OpenAI uses 6× on GPT-5.5, Gemini uses 8× on 2.5 Pro. Designing prompts to keep responses short is the single largest lever on API spend.

Do these prices include image, audio or vision tokens?

No — MVP-1 covers text-only billing. Vision-capable models charge separately per image at fixed sizes (e.g. GPT-4o's 765 tokens per low-res image, Claude's per-tile pricing). Audio for Realtime API and TTS is billed by audio-tokens. Add those manually for now; multimodal cost is on the MVP-2 roadmap for this site.

Glossary

Input tokens

Input tokens are everything you send to the model in one request: system prompt, conversation history, user message, tool definitions and any retrieved context. Billed at the model's input rate. Counting them accurately is the first step in any cost estimate; under-estimating by 2× is a common cause of surprise bills.

Output tokens

Output tokens are what the model generates back: the assistant's reply, tool-call arguments and any internal "reasoning" or "thinking" tokens for reasoning-tier models. Output is priced 3–8× higher than input by every major provider, so the cheapest way to reduce LLM spend is usually to shorten responses, not the prompt.

Cached input

Cached input is the portion of your prompt the provider already processed in a recent prior request and can reuse. OpenAI bills cache reads at 50% of standard, Anthropic at 10%, Google at 25%. The cache hit window is short (5–60 minutes depending on provider), so caching only pays off for high-traffic shared prefixes — not for one-off requests.

Batch API

A batch API submits many requests at once and returns results asynchronously, typically within 24 hours. OpenAI, Anthropic and Google all offer batch tiers at roughly half the synchronous rate. The trade-off is no live response — useful for offline data processing, bulk evaluation runs, classification jobs and embedding ingest.

Tokens-per-dollar

Tokens-per-dollar is a comparison metric: how many input or output tokens does $1 buy on each model? Gemini Flash-Lite at $0.10/M input yields 10M tokens per dollar; GPT-5.5 Pro at $30/M yields 33,333 tokens — a 300× spread. Always compute tokens-per-dollar weighted by your input/output mix, not headline rates.

Related tools

LLM Token Counter

Count tokens for GPT, Claude, Gemini, Llama, and Mistral with live cost estimate

Context Window Fit Checker

Check whether system + history + prompt fit any model context window, with output headroom

Prompt Caching Savings

Estimate how much prompt caching saves on Claude, GPT and Gemini at a given cache hit rate

Embeddings Cost Calculator

Estimate cost to embed a corpus with OpenAI, Voyage, Gemini or Mistral embedding models

RAG Chunk Estimator

Estimate chunks, embedding cost and per-query LLM cost for a RAG pipeline end-to-end

JSON Beautify

Pretty-print and format minified JSON with proper 2-space indentation in your browser

XML to JSON

Convert XML documents to JSON with attribute mapping and namespace support

HTML Encode

Escape HTML special characters into safe entities with a free online HTML encoder