Context Window
The context window is an AI’s short-term memory. It is the total amount of text the model can “see” at any one time, including your entire conversation history, any documents you have shared, and the response it is currently writing. Once the context window fills up, the model cannot reference earlier content. A bigger context window means the AI can work with longer documents and remember more of your conversation.
The context window (also called context length) is the maximum number of tokens a language model can process in a single inference pass. It includes all input tokens (system prompt, conversation history, user message, retrieved documents) and all output tokens (the model’s response).
Technical constraints:
- The Transformer’s self-attention mechanism computes pairwise relationships between all tokens, resulting in O(n^2) memory and compute complexity relative to sequence length
- Extending context windows requires architectural innovations: RoPE (Rotary Position Embedding), ALiBi (Attention with Linear Biases), sliding window attention, or sparse attention patterns
- Larger context windows increase memory usage (KV cache grows linearly with sequence length) and inference cost
Context window sizes (2026):
| Model | Context Window |
|---|---|
| Claude (Anthropic) | 200K tokens |
| GPT-4 | 128K tokens |
| Gemini 1.5 Pro | 2M tokens |
| Llama 3 | 128K tokens |
Practical implications:
- A 200K token window holds roughly 150,000 words (about 500 pages of text)
- System prompts, conversation history, and retrieved documents all consume context
- “Lost in the middle” phenomenon: models tend to pay more attention to tokens at the beginning and end of the context, with reduced recall for content in the middle
- Prompt caching: providers cache and reuse the KV cache for repeated prefixes, reducing cost and latency for repeated context
Context window management
import anthropic
client = anthropic.Anthropic()
# System prompt consumes context
system = "You are a network engineer. Answer questions about infrastructure."
# Long document injected as context
with open("network_design.md") as f:
document = f.read() # ~50K tokens
# Check if we're within limits
# Claude: 200K context window
# Budget: 200K - system_tokens - doc_tokens - response_reserve
# Rough: 200K - 50 - 50000 - 4096 = ~145K tokens remaining
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=system,
messages=[
{"role": "user", "content": f"Based on this document:\n\n{document}\n\nWhat VLANs are configured?"}
],
) Context window size is one of the primary differentiators between AI models. A larger context window means you can paste an entire codebase, a full legal contract, or hours of meeting transcripts and ask questions about them. In coding tools like Claude Code, the context window holds the conversation history, file contents, tool results, and system instructions simultaneously. When context is exhausted, older messages are compressed or dropped. RAG (retrieval-augmented generation) is partly a solution to context window limits: instead of stuffing everything in, you retrieve only the relevant chunks. For production AI applications, context window management (what to include, what to summarize, what to drop) is a critical engineering problem.