Skip to content
general artificial-intelligence

Context Window

context-window ai tokens memory
Plain English

The context window is an AI’s short-term memory. It is the total amount of text the model can “see” at any one time, including your entire conversation history, any documents you have shared, and the response it is currently writing. Once the context window fills up, the model cannot reference earlier content. A bigger context window means the AI can work with longer documents and remember more of your conversation.

Technical Definition

The context window (also called context length) is the maximum number of tokens a language model can process in a single inference pass. It includes all input tokens (system prompt, conversation history, user message, retrieved documents) and all output tokens (the model’s response).

Technical constraints:

  • The Transformer’s self-attention mechanism computes pairwise relationships between all tokens, resulting in O(n^2) memory and compute complexity relative to sequence length
  • Extending context windows requires architectural innovations: RoPE (Rotary Position Embedding), ALiBi (Attention with Linear Biases), sliding window attention, or sparse attention patterns
  • Larger context windows increase memory usage (KV cache grows linearly with sequence length) and inference cost

Context window sizes (2026):

ModelContext Window
Claude (Anthropic)200K tokens
GPT-4128K tokens
Gemini 1.5 Pro2M tokens
Llama 3128K tokens

Practical implications:

  • A 200K token window holds roughly 150,000 words (about 500 pages of text)
  • System prompts, conversation history, and retrieved documents all consume context
  • “Lost in the middle” phenomenon: models tend to pay more attention to tokens at the beginning and end of the context, with reduced recall for content in the middle
  • Prompt caching: providers cache and reuse the KV cache for repeated prefixes, reducing cost and latency for repeated context

Context window management

import anthropic

client = anthropic.Anthropic()

# System prompt consumes context
system = "You are a network engineer. Answer questions about infrastructure."

# Long document injected as context
with open("network_design.md") as f:
    document = f.read()  # ~50K tokens

# Check if we're within limits
# Claude: 200K context window
# Budget: 200K - system_tokens - doc_tokens - response_reserve
# Rough: 200K - 50 - 50000 - 4096 = ~145K tokens remaining

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    system=system,
    messages=[
        {"role": "user", "content": f"Based on this document:\n\n{document}\n\nWhat VLANs are configured?"}
    ],
)
In the Wild

Context window size is one of the primary differentiators between AI models. A larger context window means you can paste an entire codebase, a full legal contract, or hours of meeting transcripts and ask questions about them. In coding tools like Claude Code, the context window holds the conversation history, file contents, tool results, and system instructions simultaneously. When context is exhausted, older messages are compressed or dropped. RAG (retrieval-augmented generation) is partly a solution to context window limits: instead of stuffing everything in, you retrieve only the relevant chunks. For production AI applications, context window management (what to include, what to summarize, what to drop) is a critical engineering problem.