How AI Actually Works: Tokens, Context Windows, and Why Your Chatbot Forgets Things
You use AI every day. You probably have no idea what it’s actually doing. That stops now.
No CS degree required. No hand-holding. Just the mechanics: tokens, transformers, context windows, and why dumping your entire knowledge base into a chat and expecting perfect recall is a losing strategy.
What Is a Token?
The AI does not read words. Full stop.
It reads numbered chunks called tokens. One tokenThe smallest unit of text that a language model processes, typically a word, part of a word, or a punctuation mark. Read more → is roughly 3 to 4 characters, or about 0.75 words. Common words like “the” or “is” get a single token. Longer or rarer words get chopped. “Unroutable” becomes three tokens: un, rout, able. The model sees a sequence of integers. Not letters, not meaning. Numbers.
Rule of thumb: 1,000 words burns roughly 1,300 to 1,500 tokens. Code and non-English text cost more per word.
Interactive: Token Visualizer
The sentence below is broken into tokens. Hover over any token to see its ID. Notice how "unroutable" splits into 3 separate tokens.
From Prompt to Output
The model never generates a full response in one shot. It runs a loop. Here’s the pipeline:
- You submit raw text.
- The tokenizer splits it into numbered chunks.
- The TransformerThe neural network architecture behind modern AI language models, using a self-attention mechanism to process all input tokens simultaneously. Read more → processes every token against every other token. This is the attention mechanism; it builds contextual understanding.
- It scores every candidate next word across a 50,000+ token vocabulary.
- It picks one. Outputs it. Loops back to step 3.
That’s why responses stream word by word. One token per cycle. Every single time.
Interactive: Pipeline Stepper
Raw Text
You type a message in plain text. The AI cannot read this directly.
The Context Window
Think of the context windowThe maximum amount of text (measured in tokens) that a language model can read and consider at once when generating a response. Read more → as the model’s working memory. It holds everything the model can see at once: system instructions, conversation history, uploaded docs, and the current message.
It has a hard ceiling. When it fills up, the oldest content gets dropped. Gone. No warning.
Current window sizes as of April 2026:
| Model | Context Window |
|---|---|
| Claude Sonnet 4.6 | 1M tokens (~750,000 words) |
| GPT-5.4 | 272K standard, 1M via APIA set of rules and protocols that allows different software applications to communicate with each other and share data or functionality. Read more → |
| Gemini 3 Pro | 1 to 2M tokens |
| Llama 4 Scout | 10M tokens (open-weight, self-hosted) |
Bigger is not free. Compute cost scales quadratically. Double the window, quadruple the bill. Plan accordingly.
Interactive: Context Window
OK: 48%Claude Sonnet 4.6
1M tokens
~750K words
GPT-5.4
272K / 1M API tokens
~200K words
Gemini 3 Pro
1-2M tokens
~1.5M words
Llama 4 Scout
10M tokens
~7.5M words
The Lost in the Middle Problem
More context does not mean better attention. This is critical.
Research confirms that models attend strongly to the beginning and end of their context window. The middle fades.
Measured recall accuracy:
- Beginning of context: 85 to 95%
- Middle of context: 76 to 82%
- End of context: 85 to 95%
Operational takeaway: put your highest priority instructions at the top or the bottom. Period. If you bury mission-critical context in the middle of a 50-message thread or a massive document dump, the model will miss it.
Interactive: Attention Heatmap
The model pays uneven attention across its context window. Strong zones (teal) get reliable recall. The middle zone (gray) is where information gets lost.
Practical takeaway
Put your most important instructions at the very top or very bottom of your prompt. Never bury critical context in the middle of a long conversation or document dump.
How the AI Picks Its Words
After the Transformer processes context, it scores every word in its vocabulary by probability. A parameter called temperatureA parameter that controls how random or deterministic an AI model's output is, with lower values producing focused answers and higher values producing creative ones. Read more → controls how deterministic or random the selection is.
- Temperature 0: Always picks the highest probability token. Robotic. Predictable. Consistent.
- Temperature 1: Lower-ranked candidates get a real shot. More varied output, occasionally surprising.
Most production deployments run between 0.3 and 0.8. Lower for factual tasks, higher for creative work.
Interactive: Temperature Word Picker
Fixing It: How to Optimize Token Usage
Now that you know the mechanics, here is how you exploit them. These are proven techniques, not theory.
Strip the Fat From Your Prompts
Every filler word burns a token. Politeness tokens, hedging, redundant phrasing: all of it is dead weight.
Before (bloated, ~90 tokens):
I would really appreciate it if you could help me out with something.
I'm working on a project and I need you to write a Python function
that takes a list of numbers as input and then returns only the even
numbers from that list. Could you please write this function for me?
It would be great if you could also add some comments.
After (tight, ~25 tokens):
Write a Python function: input list of integers, return only evens.
Add inline comments.
Same output quality. 70% fewer tokens. The model does not care about “please” or “I would appreciate.” It cares about the instruction.
Rules:
- Cut “please,” “could you,” “I’d like you to,” and every other politeness wrapper.
- Use structured formats (bullets, schemas) instead of prose descriptions.
- One strong example beats three redundant ones.
- If the model already does something by default, do not waste tokens instructing it to.
Structure Your Context Deliberately
The lost-in-the-middle problem is not theoretical. It will bite you in production. Structure accordingly.
Put high-priority instructions at the top and bottom. Never in the middle. If you have 10 reference documents, the most relevant ones go in positions 1, 2, 9, and 10.
Repeat your most critical constraint. State it in the system prompt, then again at the end of the user message:
[System: Output must be valid JSON. No markdown fences.]
... long context block ...
[User: Analyze the dataset above. Reminder: valid JSON only, no wrapper text.]
Use explicit section headers. Models anchor attention on structural markers:
## CRITICAL REQUIREMENTS
- Return valid JSON array
- Max 5 results
## REFERENCE DATA
... your documents here ...
## TASK
Analyze and return matches.
Know When to Start Fresh
Continuing a stale conversation is one of the most common mistakes. Every new message re-sends the entire history. That history accumulates contradictions, outdated instructions, and noise.
Start a new conversation when:
- The topic shifts entirely. Old context is dead weight.
- You are past 60% of the context window. Attention quality drops.
- The conversation has accumulated conflicting instructions from iteration.
- You have a finalized artifact you can paste into a clean prompt.
Continue the conversation when:
- You are iterating on the same artifact and need prior corrections preserved.
- Steps depend on previous output.
- You are under 30% of the context window.
Rule of thumb: if you can summarize everything the model needs in under 500 tokens, start fresh and paste that summary as context.
Claude: Platform-Specific Commands
Claude has built-in tools for context management. Use them.
/compact: Summarizes older messages while preserving critical details. Recovers 30 to 50% of tokens in typical sessions. You can pass instructions: /compact keep the migration plan, drop the debugging. Use it when context usage exceeds 80%.
/clear: Wipes the conversation entirely. Use it when switching tasks. Do not carry context from one job into another.
Prompt cachingStoring frequently accessed data in a fast, temporary location so future requests can be served without repeating the original expensive operation. Read more → (API): Static content (system prompts, reference docs, tool definitions) can be cached. Cached tokens cost 90% less on subsequent requests. The cache has a 5-minute TTL, refreshed on each hit. Place static content at the beginning of your prompt, variable content at the end.
{
"model": "claude-sonnet-4-6-20260410",
"max_tokens": 1024,
"cache_control": {"type": "ephemeral"},
"system": "You are a network operations analyst...",
"messages": [...]
}
XML tags: Claude is specifically trained to parse XML structure. Use tags like <context>, <instructions>, and <examples> to delineate sections. The model processes these more efficiently than unstructured prose.
<instructions>
Analyze the firewall logs below. Flag any denied connections
from external IPs to internal management interfaces.
</instructions>
<context>
... firewall log data ...
</context>
CLAUDE.md files: For Claude Code users, persistent instructions go in CLAUDE.md files (global, project, or subdirectory level). These load automatically every session without burning tokens on repeated prompts.
ChatGPT: Platform-Specific Commands
Custom Instructions: Set your role, expertise level, and response preferences once. They apply to every conversation unless overridden by a Project or Custom GPT. Set these in your first 10 minutes with the platform and forget about them.
Memory: ChatGPT saves facts between conversations automatically: preferences, background, past decisions. You can also force a save: “Remember that I prefer YAMLA human-readable data format used for configuration files, favored for its clean syntax with indentation instead of brackets and braces. Read more → over JSONA lightweight, human-readable data format used to exchange structured information between systems, based on JavaScript object syntax. Read more → for config files.” Manage stored memories in Settings > Personalization > Memory.
The handoff process: When a conversation hits 60% capacity, summarize your progress, copy the summary, start a fresh chat, and paste it in. This is the single most effective fix for context degradation.
Summarize our progress so far in a format I can paste into a new
conversation. Include: decisions made, current state of the code,
and remaining tasks. Exclude all debugging discussion.
Conversation compaction (API): OpenAI’s Responses API supports automatic compaction. Set a compact_threshold in your request, and when the token count crosses it, the server compresses the history automatically. The compacted state is opaque but preserves key reasoning and decisions.
{
"model": "gpt-5.4",
"context_management": {
"compact_threshold": 100000
},
"input": [...]
}
You can also call /responses/compact manually for explicit control over when compression fires.
Quick Reference: Token Optimization Cheat Sheet
| Technique | Token Savings | When to Use |
|---|---|---|
| Strip filler and politeness | 30 to 50% of prompt | Every prompt, every time |
| Structured format vs prose | 20 to 40% of prompt | Output format specs, multi-field requests |
| Prompt caching (Claude API) | 90% cost on cached prefix | Static system prompts, reference docs |
/compact (Claude Code) | 30 to 50% of history | Context usage above 80% |
| Handoff to new conversation | 70 to 90% of history | Topic shift, context above 60% |
| Compaction (OpenAI API) | 40 to 60% of history | Long-running agent loops |
| RAGA technique that improves AI responses by retrieving relevant information from your own documents and feeding it to the model alongside the question. Read more → instead of doc stuffing | 85 to 95% of context | Any workflow with reference documents |
| One example instead of three | 60 to 70% of examples | Repeated patterns in few-shot prompts |
Bottom Line
There is no magic here. It is pattern matching at massive scale, one token at a time, inside a fixed memory window with uneven attention distribution.
Know this, and you operate the tool better. You write tighter prompts. You structure your context with intent. You stop expecting the model to recall something you buried 50 messages ago in the middle of a thread.
Use the tool. Understand the tool. Do not trust the tool blindly.