Context Window Fit Checker — Will Your Prompt Fit?
🔒 Runs in your browser — nothing is sent to a serverContext window fit checker that splits an LLM request into its four real parts — system message, conversation history, current user prompt, and reserved output — and tells you exactly whether the combined token count fits inside the model's context window. Pick GPT-5.5, Claude Opus 4.7, Gemini 2.5 Pro, Llama 3.3 or any other supported model, paste each section into its own slot, set how many output tokens you need to reserve and a safety buffer, and a stacked bar shows where every token goes. Use it to debug "context length exceeded" 400 errors, plan RAG chunk sizes, and decide whether to truncate or summarise chat history.
0 tokens
0 tokens
0 tokens
How to use this checker
Start by picking the target model from the dropdown — every model has its own context window and max-output cap, both shown in the Status panel below. Then paste your request into the three slots: the System message (your instructions, persona, tool definitions), Conversation history (every prior user and assistant turn that needs to be re-sent), and Current user prompt (the new question). All three textareas share a synced height — drag the corner-grip on any of them to make all three taller or shorter at once. Token counts update live under each slot as you type. In the right column, set Reserved output to how many tokens you expect the model to write back (be generous — for reasoning-tier models like GPT-5.5 Pro and o-series, multiply your visible-answer estimate by 3–4× to cover invisible thinking tokens). Set Safety buffer to the percentage of the window you want to leave free as a margin (15% is a good default). The stacked bar visualises where every token goes — System, History, Prompt, formatting overhead, reserved output, and buffer — against the model's context window. A green Status banner means the request fits; amber means tight; red means it overflows. Switching the model dropdown re-counts everything under that model's tokenizer family, so you can compare instantly without retyping anything.
How to plan a context budget
Start with the model's context window — call it C. Subtract a 15% safety buffer to get C × 0.85 of usable space. Out of that, reserve M tokens for output (M ≤ the model's max-output cap; typically 4K–16K for chat, 32K+ for code generation). The remaining C × 0.85 − M is your input budget, which you must split across system prompt, retrieved context and conversation history. Static parts (system + a few-shot examples) should fit in roughly 10% of the budget; the rest is for dynamic context. If retrieval dominates, prefer a larger-window model or chunk the documents and pick top-K — usually K × chunk_size + query + system stays well under any 1M-token window.
Why context windows are sometimes misleading
A model advertising a 1M-token window does not always deliver 1M-token quality. Most benchmarks show measurable quality degradation past roughly 50–100K tokens of context: needle-in-a-haystack accuracy drops, the model starts ignoring early instructions, reasoning quality degrades. Use the full window for storage-style tasks (find this fact, summarise this document) and a smaller fraction for instruction-following tasks (system prompt at the top, recent history closest to the query). When fitting matters but quality matters more, prefer a 200K-window model running at 50% utilisation over a 1M-window model running at 95%.
Examples
600-token system, 1,200-token history, 80-token promptFits easily in any 128K+ model. 4,000-token output reservation still leaves ~120K of headroom on GPT-4o.1,000-token system, 50,000-token history (retrieved chunks), 80-token promptFills 40% of Claude Haiku 4.5 200K window; 5% of Claude Sonnet 4.6 1M; overflows any 32K-only model.2,000-token system, 90,000-token history, 1,500-token promptFits Claude Sonnet 4.6 with ~900K free; tight on GPT-4o (only 36K left for output); fails 128K-only models past the safety buffer.