April 10, 2026 · 9 min read ai machine-learning explainer tech-basics ·

How AI Actually Works: Tokens, Context Windows, and Why Your Chatbot Forgets Things

You use AI every day. You probably have no idea what it’s actually doing. That stops now.

No CS degree required. No hand-holding. Just the mechanics: tokens, transformers, context windows, and why dumping your entire knowledge base into a chat and expecting perfect recall is a losing strategy.

What Is a Token?

The AI does not read words. Full stop.

It reads numbered chunks called tokens. One token is roughly 3 to 4 characters, or about 0.75 words. Common words like “the” or “is” get a single token. Longer or rarer words get chopped. “Unroutable” becomes three tokens: un, rout, able. The model sees a sequence of integers. Not letters, not meaning. Numbers.

Rule of thumb: 1,000 words burns roughly 1,300 to 1,500 tokens. Code and non-English text cost more per word.

Interactive: Token Visualizer

The sentence below is broken into tokens. Hover over any token to see its ID. Notice how "unroutable" splits into 3 separate tokens.

Words: 5Tokens: 7Ratio: 1.4x

From Prompt to Output

The model never generates a full response in one shot. It runs a loop. Here’s the pipeline:

You submit raw text.
The tokenizer splits it into numbered chunks.
The Transformer processes every token against every other token. This is the attention mechanism; it builds contextual understanding.
It scores every candidate next word across a 50,000+ token vocabulary.
It picks one. Outputs it. Loops back to step 3.

That’s why responses stream word by word. One token per cycle. Every single time.

Interactive: Pipeline Stepper

↻ loop 3-5

Raw Text

You type a message in plain text. The AI cannot read this directly.

1 / 5

The Context Window

Think of the context window as the model’s working memory. It holds everything the model can see at once: system instructions, conversation history, uploaded docs, and the current message.

It has a hard ceiling. When it fills up, the oldest content gets dropped. Gone. No warning.

Current window sizes as of April 2026:

Model	Context Window
Claude Sonnet 4.6	1M tokens (~750,000 words)
GPT-5.4	272K standard, 1M via API
Gemini 3 Pro	1 to 2M tokens
Llama 4 Scout	10M tokens (open-weight, self-hosted)

Bigger is not free. Compute cost scales quadratically. Double the window, quadruple the bill. Plan accordingly.

Interactive: Context Window

OK: 48%

Conversation

Documents

Free: 52%

System Prompt

Conversation

Documents

Drag to fill the context window:

Claude Sonnet 4.6

1M tokens

~750K words

GPT-5.4

272K / 1M API tokens

~200K words

Gemini 3 Pro

1-2M tokens

~1.5M words

Llama 4 Scout

10M tokens

~7.5M words

The Lost in the Middle Problem

More context does not mean better attention. This is critical.

Research confirms that models attend strongly to the beginning and end of their context window. The middle fades.

Measured recall accuracy:

Beginning of context: 85 to 95%
Middle of context: 76 to 82%
End of context: 85 to 95%

Operational takeaway: put your highest priority instructions at the top or the bottom. Period. If you bury mission-critical context in the middle of a 50-message thread or a massive document dump, the model will miss it.

Interactive: Attention Heatmap

The model pays uneven attention across its context window. Strong zones (teal) get reliable recall. The middle zone (gray) is where information gets lost.

Beginning85-95%

Middle76-82%

End85-95%

Strong attention

Weak attention

Practical takeaway

Put your most important instructions at the very top or very bottom of your prompt. Never bury critical context in the middle of a long conversation or document dump.

How the AI Picks Its Words

After the Transformer processes context, it scores every word in its vocabulary by probability. A parameter called temperature controls how deterministic or random the selection is.

Temperature 0: Always picks the highest probability token. Robotic. Predictable. Consistent.
Temperature 1: Lower-ranked candidates get a real shot. More varied output, occasionally surprising.

Most production deployments run between 0.3 and 0.8. Lower for factual tasks, higher for creative work.

Interactive: Temperature Word Picker

The infrastructure network|

Temperature: 0.5Focused

0 (predictable)1 (creative)

Fixing It: How to Optimize Token Usage

Now that you know the mechanics, here is how you exploit them. These are proven techniques, not theory.

Strip the Fat From Your Prompts

Every filler word burns a token. Politeness tokens, hedging, redundant phrasing: all of it is dead weight.

Before (bloated, ~90 tokens):

I would really appreciate it if you could help me out with something.
I'm working on a project and I need you to write a Python function
that takes a list of numbers as input and then returns only the even
numbers from that list. Could you please write this function for me?
It would be great if you could also add some comments.

After (tight, ~25 tokens):

Write a Python function: input list of integers, return only evens.
Add inline comments.

Same output quality. 70% fewer tokens. The model does not care about “please” or “I would appreciate.” It cares about the instruction.

Rules:

Cut “please,” “could you,” “I’d like you to,” and every other politeness wrapper.
Use structured formats (bullets, schemas) instead of prose descriptions.
One strong example beats three redundant ones.
If the model already does something by default, do not waste tokens instructing it to.

Structure Your Context Deliberately

The lost-in-the-middle problem is not theoretical. It will bite you in production. Structure accordingly.

Put high-priority instructions at the top and bottom. Never in the middle. If you have 10 reference documents, the most relevant ones go in positions 1, 2, 9, and 10.

Repeat your most critical constraint. State it in the system prompt, then again at the end of the user message:

[System: Output must be valid JSON. No markdown fences.]

... long context block ...

[User: Analyze the dataset above. Reminder: valid JSON only, no wrapper text.]

Use explicit section headers. Models anchor attention on structural markers:

## CRITICAL REQUIREMENTS
- Return valid JSON array
- Max 5 results

## REFERENCE DATA
... your documents here ...

## TASK
Analyze and return matches.

Know When to Start Fresh

Continuing a stale conversation is one of the most common mistakes. Every new message re-sends the entire history. That history accumulates contradictions, outdated instructions, and noise.

Start a new conversation when:

The topic shifts entirely. Old context is dead weight.
You are past 60% of the context window. Attention quality drops.
The conversation has accumulated conflicting instructions from iteration.
You have a finalized artifact you can paste into a clean prompt.

Continue the conversation when:

You are iterating on the same artifact and need prior corrections preserved.
Steps depend on previous output.
You are under 30% of the context window.

Rule of thumb: if you can summarize everything the model needs in under 500 tokens, start fresh and paste that summary as context.

Claude: Platform-Specific Commands

Claude has built-in tools for context management. Use them.

/compact: Summarizes older messages while preserving critical details. Recovers 30 to 50% of tokens in typical sessions. You can pass instructions: /compact keep the migration plan, drop the debugging. Use it when context usage exceeds 80%.

/clear: Wipes the conversation entirely. Use it when switching tasks. Do not carry context from one job into another.

Prompt caching (API): Static content (system prompts, reference docs, tool definitions) can be cached. Cached tokens cost 90% less on subsequent requests. The cache has a 5-minute TTL, refreshed on each hit. Place static content at the beginning of your prompt, variable content at the end.

{
  "model": "claude-sonnet-4-6-20260410",
  "max_tokens": 1024,
  "cache_control": {"type": "ephemeral"},
  "system": "You are a network operations analyst...",
  "messages": [...]
}

XML tags: Claude is specifically trained to parse XML structure. Use tags like <context>, <instructions>, and <examples> to delineate sections. The model processes these more efficiently than unstructured prose.

<instructions>
Analyze the firewall logs below. Flag any denied connections
from external IPs to internal management interfaces.
</instructions>

<context>
... firewall log data ...
</context>

CLAUDE.md files: For Claude Code users, persistent instructions go in CLAUDE.md files (global, project, or subdirectory level). These load automatically every session without burning tokens on repeated prompts.

ChatGPT: Platform-Specific Commands

Custom Instructions: Set your role, expertise level, and response preferences once. They apply to every conversation unless overridden by a Project or Custom GPT. Set these in your first 10 minutes with the platform and forget about them.

Memory: ChatGPT saves facts between conversations automatically: preferences, background, past decisions. You can also force a save: “Remember that I prefer YAML over JSON for config files.” Manage stored memories in Settings > Personalization > Memory.

The handoff process: When a conversation hits 60% capacity, summarize your progress, copy the summary, start a fresh chat, and paste it in. This is the single most effective fix for context degradation.

Summarize our progress so far in a format I can paste into a new
conversation. Include: decisions made, current state of the code,
and remaining tasks. Exclude all debugging discussion.

Conversation compaction (API): OpenAI’s Responses API supports automatic compaction. Set a compact_threshold in your request, and when the token count crosses it, the server compresses the history automatically. The compacted state is opaque but preserves key reasoning and decisions.

{
  "model": "gpt-5.4",
  "context_management": {
    "compact_threshold": 100000
  },
  "input": [...]
}

You can also call /responses/compact manually for explicit control over when compression fires.

Quick Reference: Token Optimization Cheat Sheet

Technique	Token Savings	When to Use
Strip filler and politeness	30 to 50% of prompt	Every prompt, every time
Structured format vs prose	20 to 40% of prompt	Output format specs, multi-field requests
Prompt caching (Claude API)	90% cost on cached prefix	Static system prompts, reference docs
`/compact` (Claude Code)	30 to 50% of history	Context usage above 80%
Handoff to new conversation	70 to 90% of history	Topic shift, context above 60%
Compaction (OpenAI API)	40 to 60% of history	Long-running agent loops
RAG instead of doc stuffing	85 to 95% of context	Any workflow with reference documents
One example instead of three	60 to 70% of examples	Repeated patterns in few-shot prompts

Bottom Line

There is no magic here. It is pattern matching at massive scale, one token at a time, inside a fixed memory window with uneven attention distribution.

Know this, and you operate the tool better. You write tighter prompts. You structure your context with intent. You stop expecting the model to recall something you buried 50 messages ago in the middle of a thread.

Use the tool. Understand the tool. Do not trust the tool blindly.

Share this post

X LinkedIn Reddit Facebook