Skip to content
· 9 min read ai machine-learning explainer tech-basics ·

How AI Actually Works: Tokens, Context Windows, and Why Your Chatbot Forgets Things

How AI Actually Works: Tokens, Context Windows, and Why Your Chatbot Forgets Things

You use AI every day. You probably have no idea what it’s actually doing. That stops now.

No CS degree required. No hand-holding. Just the mechanics: tokens, transformers, context windows, and why dumping your entire knowledge base into a chat and expecting perfect recall is a losing strategy.

What Is a Token?

The AI does not read words. Full stop.

It reads numbered chunks called tokens. One tokenThe smallest unit of text that a language model processes, typically a word, part of a word, or a punctuation mark. Read more → is roughly 3 to 4 characters, or about 0.75 words. Common words like “the” or “is” get a single token. Longer or rarer words get chopped. “Unroutable” becomes three tokens: un, rout, able. The model sees a sequence of integers. Not letters, not meaning. Numbers.

Rule of thumb: 1,000 words burns roughly 1,300 to 1,500 tokens. Code and non-English text cost more per word.

Interactive: Token Visualizer

The sentence below is broken into tokens. Hover over any token to see its ID. Notice how "unroutable" splits into 3 separate tokens.

Words: 5Tokens: 7Ratio: 1.4x

From Prompt to Output

The model never generates a full response in one shot. It runs a loop. Here’s the pipeline:

  1. You submit raw text.
  2. The tokenizer splits it into numbered chunks.
  3. The TransformerThe neural network architecture behind modern AI language models, using a self-attention mechanism to process all input tokens simultaneously. Read more → processes every token against every other token. This is the attention mechanism; it builds contextual understanding.
  4. It scores every candidate next word across a 50,000+ token vocabulary.
  5. It picks one. Outputs it. Loops back to step 3.

That’s why responses stream word by word. One token per cycle. Every single time.

Interactive: Pipeline Stepper

↻ loop 3-5
Aa

Raw Text

You type a message in plain text. The AI cannot read this directly.

1 / 5

The Context Window

Think of the context windowThe maximum amount of text (measured in tokens) that a language model can read and consider at once when generating a response. Read more → as the model’s working memory. It holds everything the model can see at once: system instructions, conversation history, uploaded docs, and the current message.

It has a hard ceiling. When it fills up, the oldest content gets dropped. Gone. No warning.

Current window sizes as of April 2026:

ModelContext Window
Claude Sonnet 4.61M tokens (~750,000 words)
GPT-5.4272K standard, 1M via APIA set of rules and protocols that allows different software applications to communicate with each other and share data or functionality. Read more →
Gemini 3 Pro1 to 2M tokens
Llama 4 Scout10M tokens (open-weight, self-hosted)

Bigger is not free. Compute cost scales quadratically. Double the window, quadruple the bill. Plan accordingly.

Interactive: Context Window

OK: 48%
Conversation
Documents
Free: 52%
System Prompt
Conversation
Documents

Claude Sonnet 4.6

1M tokens

~750K words

GPT-5.4

272K / 1M API tokens

~200K words

Gemini 3 Pro

1-2M tokens

~1.5M words

Llama 4 Scout

10M tokens

~7.5M words

The Lost in the Middle Problem

More context does not mean better attention. This is critical.

Research confirms that models attend strongly to the beginning and end of their context window. The middle fades.

Measured recall accuracy:

  • Beginning of context: 85 to 95%
  • Middle of context: 76 to 82%
  • End of context: 85 to 95%

Operational takeaway: put your highest priority instructions at the top or the bottom. Period. If you bury mission-critical context in the middle of a 50-message thread or a massive document dump, the model will miss it.

Interactive: Attention Heatmap

The model pays uneven attention across its context window. Strong zones (teal) get reliable recall. The middle zone (gray) is where information gets lost.

Beginning85-95%
Middle76-82%
End85-95%
Strong attention
Weak attention

Practical takeaway

Put your most important instructions at the very top or very bottom of your prompt. Never bury critical context in the middle of a long conversation or document dump.

How the AI Picks Its Words

After the Transformer processes context, it scores every word in its vocabulary by probability. A parameter called temperatureA parameter that controls how random or deterministic an AI model's output is, with lower values producing focused answers and higher values producing creative ones. Read more → controls how deterministic or random the selection is.

  • Temperature 0: Always picks the highest probability token. Robotic. Predictable. Consistent.
  • Temperature 1: Lower-ranked candidates get a real shot. More varied output, occasionally surprising.

Most production deployments run between 0.3 and 0.8. Lower for factual tasks, higher for creative work.

Interactive: Temperature Word Picker

The infrastructure network|
Focused
0 (predictable)1 (creative)

Fixing It: How to Optimize Token Usage

Now that you know the mechanics, here is how you exploit them. These are proven techniques, not theory.

Strip the Fat From Your Prompts

Every filler word burns a token. Politeness tokens, hedging, redundant phrasing: all of it is dead weight.

Before (bloated, ~90 tokens):

I would really appreciate it if you could help me out with something.
I'm working on a project and I need you to write a Python function
that takes a list of numbers as input and then returns only the even
numbers from that list. Could you please write this function for me?
It would be great if you could also add some comments.

After (tight, ~25 tokens):

Write a Python function: input list of integers, return only evens.
Add inline comments.

Same output quality. 70% fewer tokens. The model does not care about “please” or “I would appreciate.” It cares about the instruction.

Rules:

  • Cut “please,” “could you,” “I’d like you to,” and every other politeness wrapper.
  • Use structured formats (bullets, schemas) instead of prose descriptions.
  • One strong example beats three redundant ones.
  • If the model already does something by default, do not waste tokens instructing it to.

Structure Your Context Deliberately

The lost-in-the-middle problem is not theoretical. It will bite you in production. Structure accordingly.

Put high-priority instructions at the top and bottom. Never in the middle. If you have 10 reference documents, the most relevant ones go in positions 1, 2, 9, and 10.

Repeat your most critical constraint. State it in the system prompt, then again at the end of the user message:

[System: Output must be valid JSON. No markdown fences.]

... long context block ...

[User: Analyze the dataset above. Reminder: valid JSON only, no wrapper text.]

Use explicit section headers. Models anchor attention on structural markers:

## CRITICAL REQUIREMENTS
- Return valid JSON array
- Max 5 results

## REFERENCE DATA
... your documents here ...

## TASK
Analyze and return matches.

Know When to Start Fresh

Continuing a stale conversation is one of the most common mistakes. Every new message re-sends the entire history. That history accumulates contradictions, outdated instructions, and noise.

Start a new conversation when:

  • The topic shifts entirely. Old context is dead weight.
  • You are past 60% of the context window. Attention quality drops.
  • The conversation has accumulated conflicting instructions from iteration.
  • You have a finalized artifact you can paste into a clean prompt.

Continue the conversation when:

  • You are iterating on the same artifact and need prior corrections preserved.
  • Steps depend on previous output.
  • You are under 30% of the context window.

Rule of thumb: if you can summarize everything the model needs in under 500 tokens, start fresh and paste that summary as context.

Claude: Platform-Specific Commands

Claude has built-in tools for context management. Use them.

/compact: Summarizes older messages while preserving critical details. Recovers 30 to 50% of tokens in typical sessions. You can pass instructions: /compact keep the migration plan, drop the debugging. Use it when context usage exceeds 80%.

/clear: Wipes the conversation entirely. Use it when switching tasks. Do not carry context from one job into another.

Prompt cachingStoring frequently accessed data in a fast, temporary location so future requests can be served without repeating the original expensive operation. Read more → (API): Static content (system prompts, reference docs, tool definitions) can be cached. Cached tokens cost 90% less on subsequent requests. The cache has a 5-minute TTL, refreshed on each hit. Place static content at the beginning of your prompt, variable content at the end.

{
  "model": "claude-sonnet-4-6-20260410",
  "max_tokens": 1024,
  "cache_control": {"type": "ephemeral"},
  "system": "You are a network operations analyst...",
  "messages": [...]
}

XML tags: Claude is specifically trained to parse XML structure. Use tags like <context>, <instructions>, and <examples> to delineate sections. The model processes these more efficiently than unstructured prose.

<instructions>
Analyze the firewall logs below. Flag any denied connections
from external IPs to internal management interfaces.
</instructions>

<context>
... firewall log data ...
</context>

CLAUDE.md files: For Claude Code users, persistent instructions go in CLAUDE.md files (global, project, or subdirectory level). These load automatically every session without burning tokens on repeated prompts.

ChatGPT: Platform-Specific Commands

Custom Instructions: Set your role, expertise level, and response preferences once. They apply to every conversation unless overridden by a Project or Custom GPT. Set these in your first 10 minutes with the platform and forget about them.

Memory: ChatGPT saves facts between conversations automatically: preferences, background, past decisions. You can also force a save: “Remember that I prefer YAMLA human-readable data format used for configuration files, favored for its clean syntax with indentation instead of brackets and braces. Read more → over JSONA lightweight, human-readable data format used to exchange structured information between systems, based on JavaScript object syntax. Read more → for config files.” Manage stored memories in Settings > Personalization > Memory.

The handoff process: When a conversation hits 60% capacity, summarize your progress, copy the summary, start a fresh chat, and paste it in. This is the single most effective fix for context degradation.

Summarize our progress so far in a format I can paste into a new
conversation. Include: decisions made, current state of the code,
and remaining tasks. Exclude all debugging discussion.

Conversation compaction (API): OpenAI’s Responses API supports automatic compaction. Set a compact_threshold in your request, and when the token count crosses it, the server compresses the history automatically. The compacted state is opaque but preserves key reasoning and decisions.

{
  "model": "gpt-5.4",
  "context_management": {
    "compact_threshold": 100000
  },
  "input": [...]
}

You can also call /responses/compact manually for explicit control over when compression fires.

Quick Reference: Token Optimization Cheat Sheet

TechniqueToken SavingsWhen to Use
Strip filler and politeness30 to 50% of promptEvery prompt, every time
Structured format vs prose20 to 40% of promptOutput format specs, multi-field requests
Prompt caching (Claude API)90% cost on cached prefixStatic system prompts, reference docs
/compact (Claude Code)30 to 50% of historyContext usage above 80%
Handoff to new conversation70 to 90% of historyTopic shift, context above 60%
Compaction (OpenAI API)40 to 60% of historyLong-running agent loops
RAGA technique that improves AI responses by retrieving relevant information from your own documents and feeding it to the model alongside the question. Read more → instead of doc stuffing85 to 95% of contextAny workflow with reference documents
One example instead of three60 to 70% of examplesRepeated patterns in few-shot prompts

Bottom Line

There is no magic here. It is pattern matching at massive scale, one token at a time, inside a fixed memory window with uneven attention distribution.

Know this, and you operate the tool better. You write tighter prompts. You structure your context with intent. You stop expecting the model to recall something you buried 50 messages ago in the middle of a thread.

Use the tool. Understand the tool. Do not trust the tool blindly.

Related Posts