Context Windows
Length, Cost, and What Happens at the Limits
Context length is one of the most visible specs on a model's data sheet. It's also one of the most misunderstood — because the cost of using a long context doesn't scale the way most people expect, and the quality often doesn't either.
What the context window actually is
The context window is the total number of tokens a model can see at once during a single forward pass. This includes everything: the system prompt, conversation history, any documents you've retrieved, the current user message, and space for the model's response. All of it competes for the same fixed budget.
When people say a model has a "128k context," they mean 128,000 tokens can be present simultaneously. In rough terms, that's somewhere around 90,000–100,000 words — but token count and word count don't map cleanly, and the exact ratio depends on the language and content type.
// how_context_is_used — example allocation
128k token budget — one request
How attention cost scales with context
Attention is the mechanism that lets every token attend to every other token in the context. The cost of computing attention scales with the square of the sequence length. Double the context, quadruple the attention computation. This is the O(n²) complexity that comes up repeatedly in discussions of long-context inference.
In practice, attention is only part of the total compute cost — the feed-forward layers scale linearly with context length, not quadratically. But attention dominates at long sequence lengths, which is why going from a short context to a very long one increases cost disproportionately.
// The quadratic problem
Doubling context from 8k to 16k tokens doesn't double the prefill cost — it roughly quadruples the attention computation. This is why FlashAttention and similar optimisations exist: they reduce the memory footprint of attention without changing its mathematical output, making long-context inference feasible on hardware that would otherwise run out of memory.
Memory cost of long context
The KV cache stores the key and value tensors for every token in the context, for every layer of the model. It grows linearly with context length. A short context uses little KV cache memory; a long context can consume a significant portion of available VRAM — and in extreme cases, more VRAM than the model weights themselves.
This is why context length and batch size trade off directly. If you're serving a model at a very long context, the KV cache for a single request takes up so much memory that you can't fit many concurrent requests. Throughput drops even if the hardware is otherwise capable.
// relative kv cache memory — same model, different context lengths
KV cache scales linearly with context length. At 128k tokens it can exceed model weight memory for many architectures.
Prefill cost at long context
The prefill phase processes the entire context in parallel before generating the first token. At short contexts, prefill is fast. At very long contexts — say, a document-heavy RAG request with 50k+ tokens of retrieved content — prefill can take seconds. This directly increases TTFT (time to first token), which users experience as the model "thinking" before it starts responding.
This is one reason why chunked prefill matters in production: by breaking long prefill into smaller chunks, the serving system can interleave prefill work with ongoing decode for other requests, rather than blocking everything while one long request completes its prefill.
Quality at the limits
Models are trained and evaluated on specific context distributions. Most training data contains relatively short sequences. While models are often fine-tuned to extend context length, their ability to effectively use information spread across a very long context is not uniform — and it degrades in specific ways that are worth knowing.
The lost-in-the-middle problem
Research consistently shows that models are better at using information near the beginning and end of their context than information buried in the middle. If you have a long context with critical information in the middle, the model may effectively ignore it even though it's technically within the context window. This isn't a bug — it's a reflection of how attention patterns form during training and fine-tuning.
Effective context vs. nominal context
A model with a 128k token context window doesn't uniformly utilise all 128k tokens. The effective context — the range within which information is reliably attended to — is typically shorter than the nominal maximum. How much shorter depends on the model, the task, and how the context is structured. Testing on your specific workload is the only reliable way to know.
// Don't assume the context window is the ceiling
Just because a model supports 128k tokens doesn't mean you should use 128k tokens. Longer context means higher cost, higher latency, more VRAM pressure, and potentially lower quality on information in the middle. The right context length is the shortest one that contains what the model actually needs.
Long context vs. RAG
One of the practical decisions teams face is whether to give the model a very long context (stuffing in all potentially relevant documents) or to use retrieval to select only the most relevant content and keep the context shorter. This is a cost, latency, and quality tradeoff.
| Approach | Cost | Latency | Quality risk |
|---|---|---|---|
| Long context (stuff everything in) | High — scales with total content | High TTFT from long prefill | Lost-in-the-middle; model may miss key information |
| RAG (retrieve then generate) | Lower — only relevant chunks in context | Adds retrieval latency, lower prefill cost | Retrieval can miss relevant content; chunking quality matters |
| Hybrid (retrieve + rerank + longer context) | Moderate | Moderate | Better coverage than RAG alone; less bloat than stuffing |
There's no universal winner. For tasks where the model needs to reason across the full document (contracts, codebases, long transcripts), long context can outperform RAG. For tasks where specific facts need to be retrieved from a large corpus, RAG is typically more practical and cost-effective.
// In short