Context Window Fit Checker — Will Your Prompt Fit?

🔒 Runs in your browser — nothing is sent to a server

Context window fit checker that splits an LLM request into its four real parts — system message, conversation history, current user prompt, and reserved output — and tells you exactly whether the combined token count fits inside the model's context window. Pick GPT-5.5, Claude Opus 4.7, Gemini 2.5 Pro, Llama 3.3 or any other supported model, paste each section into its own slot, set how many output tokens you need to reserve and a safety buffer, and a stacked bar shows where every token goes. Use it to debug "context length exceeded" 400 errors, plan RAG chunk sizes, and decide whether to truncate or summarise chat history.

0 tokens

0 tokens

0 tokens

Statusapproximate ±10%
Fits. 64.0K tokens of output headroom available.
Window used
15.4%
154.0K of 1.00M
Reserved output: 4,000Safety buffer: 150.0K
Total input tokens
0
Remaining for output
850.0K
Effective max output
64.0K
Model cap: 64.0K
Formatting overhead
0
~4 tokens per message

How to use this checker

Start by picking the target model from the dropdown — every model has its own context window and max-output cap, both shown in the Status panel below. Then paste your request into the three slots: the System message (your instructions, persona, tool definitions), Conversation history (every prior user and assistant turn that needs to be re-sent), and Current user prompt (the new question). All three textareas share a synced height — drag the corner-grip on any of them to make all three taller or shorter at once. Token counts update live under each slot as you type. In the right column, set Reserved output to how many tokens you expect the model to write back (be generous — for reasoning-tier models like GPT-5.5 Pro and o-series, multiply your visible-answer estimate by 3–4× to cover invisible thinking tokens). Set Safety buffer to the percentage of the window you want to leave free as a margin (15% is a good default). The stacked bar visualises where every token goes — System, History, Prompt, formatting overhead, reserved output, and buffer — against the model's context window. A green Status banner means the request fits; amber means tight; red means it overflows. Switching the model dropdown re-counts everything under that model's tokenizer family, so you can compare instantly without retyping anything.

How to plan a context budget

Start with the model's context window — call it C. Subtract a 15% safety buffer to get C × 0.85 of usable space. Out of that, reserve M tokens for output (M ≤ the model's max-output cap; typically 4K–16K for chat, 32K+ for code generation). The remaining C × 0.85 − M is your input budget, which you must split across system prompt, retrieved context and conversation history. Static parts (system + a few-shot examples) should fit in roughly 10% of the budget; the rest is for dynamic context. If retrieval dominates, prefer a larger-window model or chunk the documents and pick top-K — usually K × chunk_size + query + system stays well under any 1M-token window.

Why context windows are sometimes misleading

A model advertising a 1M-token window does not always deliver 1M-token quality. Most benchmarks show measurable quality degradation past roughly 50–100K tokens of context: needle-in-a-haystack accuracy drops, the model starts ignoring early instructions, reasoning quality degrades. Use the full window for storage-style tasks (find this fact, summarise this document) and a smaller fraction for instruction-following tasks (system prompt at the top, recent history closest to the query). When fitting matters but quality matters more, prefer a 200K-window model running at 50% utilisation over a 1M-window model running at 95%.

Examples

Input
600-token system, 1,200-token history, 80-token prompt
Output
Fits easily in any 128K+ model. 4,000-token output reservation still leaves ~120K of headroom on GPT-4o.
Default chat — short system + a few turns + question
Input
1,000-token system, 50,000-token history (retrieved chunks), 80-token prompt
Output
Fills 40% of Claude Haiku 4.5 200K window; 5% of Claude Sonnet 4.6 1M; overflows any 32K-only model.
RAG-heavy assistant — 50K of retrieved context
Input
2,000-token system, 90,000-token history, 1,500-token prompt
Output
Fits Claude Sonnet 4.6 with ~900K free; tight on GPT-4o (only 36K left for output); fails 128K-only models past the safety buffer.
Coding agent — long history of code edits

FAQ

What counts toward the context window in an LLM request?

Every token you send is billed and consumes window: the system prompt, all prior conversation turns, retrieved RAG context, tool definitions, the new user message, plus 3–5 control tokens of formatting overhead per message. The model's eventual output also lives inside the same window — your input plus the response cannot exceed the model's context limit, or the API returns HTTP 400 "context length exceeded".

Why does the checker add a per-message overhead?

Chat completion APIs wrap every message in role markers like <|im_start|>system, <|im_end|>, role:"user", and similar control tokens behind the scenes. Each user/assistant/system message costs roughly 3–5 invisible tokens of formatting on top of its content. With 20 conversation turns that's 60–100 hidden tokens, which matters once you approach the limit. The fit checker estimates 4 tokens per non-empty section.

How big a safety buffer should I leave?

Most production systems reserve 10–20% of the context window unused, so the model has room to generate a complete answer plus invisible reasoning tokens for o-series and GPT-5.5 Pro. The default 15% works for typical chat workloads. Drop to 5% only if your output is strictly capped (e.g. JSON with five fields) and you have measured the worst-case response.

What is the difference between context window and max output?

Context window is the model's total budget; max output is a separate per-response cap, usually much smaller. Claude Sonnet 4.6 has a 1,000,000-token context window but caps output at 64,000 tokens. GPT-5.5 has a 1M window but capped at 16K output. You can read up to context-window-minus-max-output tokens at most. The fit checker shows the effective max output as the lower of (remaining window) and (model max output cap).

How do reasoning tokens fit in?

Reasoning models — GPT-5.5 Pro, OpenAI o-series, DeepSeek R1 — generate internal "thinking" tokens before producing the visible answer. Thinking tokens count toward your output budget and toward billing exactly like normal output. For hard tasks the thinking pass can consume 2,000–10,000 tokens before the user-facing reply starts. Reserve at least 4× the visible-output estimate when calling reasoning models.

What strategies help when my prompt exceeds the window?

Four options, roughly in order of effort: (1) switch to a model with a larger window (Gemini 2.5 Pro and Claude Sonnet 4.6 both expose 1M); (2) truncate the oldest conversation turns when usage passes ~85% — the classic sliding-window strategy; (3) summarise older history with a cheap model and replace it with the summary; (4) move static context (knowledge base, large system prompt) to a retrieval system and only inject relevant chunks per query.

Why does the same context look smaller on Claude than on GPT?

Different tokenizers split text differently. Claude's vocabulary is slightly denser than OpenAI's o200k for English prose — the same paragraph counts 5–7% fewer tokens. Code, JSON and non-Latin text amplify the gap. The fit checker re-counts your text under each model's tokenizer family, so switching the model dropdown gives you a live comparison.

Glossary

Context window

Context window is the maximum number of tokens — input plus output combined — a model can process in one request. Modern flagship models expose 1M-token windows (GPT-5.5, Claude Opus 4.7, Gemini 2.5 Pro); compact production models stick to 128K or 200K. Exceeding the limit returns HTTP 400 immediately; the request never reaches the model.

Max output tokens

Max output tokens is the per-response cap a provider applies on top of the context window. Even with 1M tokens of room, Claude Sonnet 4.6 will not generate more than 64K in one call. The cap exists for latency reasons — generation is sequential, so very long outputs would tie up infrastructure. Override via `max_tokens` to be lower if you want shorter responses.

Formatting overhead

Chat completion APIs inject special control tokens around each message to mark its role (system / user / assistant) and boundaries. The exact tokens vary by provider but typically add 3–5 tokens per message. In long conversations with dozens of turns this overhead reaches 100+ tokens. The fit checker estimates ~4 tokens per non-empty section to stay conservative.

Safety buffer

A safety buffer is the unused fraction of the context window you intentionally leave free as a margin for token-count error, reasoning tokens, and conversation growth. Production deployments commonly reserve 10–20% of the window. Too small a buffer causes intermittent 400 errors when traffic edges past the limit; too large wastes capacity and costs more than necessary.

Sliding window truncation

Sliding window truncation is a chat-history management pattern that drops the oldest user/assistant message pairs whenever the running token count crosses a threshold (typically 85% of the context window). The system prompt is preserved, the most recent turns stay, and the conversation appears continuous to the user. Used by virtually every production chatbot to keep request size bounded.

Related tools