Inference
Training teaches an AI model by showing it millions of examples. Inference is when the trained model actually does its job: you give it a question and it generates an answer. Every time you chat with Claude, ask GPT to write code, or use an AI image generator, you are running inference. Training happens once (and costs millions of dollars); inference happens billions of times a day and is what you pay for per token when using AI APIs.
Inference is the forward pass of a trained model: processing input data through the network to produce an output (prediction, classification, generation). In the context of LLMs, inference is the autoregressive token-by-token generation process.
Training vs. inference:
| Aspect | Training | Inference |
|---|---|---|
| Purpose | Learn parameters from data | Generate output from input |
| Compute | Massive (thousands of GPUs, weeks) | Moderate (single GPU, milliseconds per token) |
| Cost | Millions of dollars | Cents per request |
| Frequency | Once (or periodically) | Continuously, billions of times |
| Gradient computation | Yes (backpropagation) | No (forward pass only) |
LLM inference pipeline:
- Tokenize input text into token IDs
- Prefill: process all input tokens in parallel, build the KV cache
- Decode: generate output tokens one at a time, each attending to all previous tokens via the KV cache
- Sample: select the next token from the probability distribution (temperature, top-p, top-k control randomness)
- Repeat steps 3-4 until stop condition (max tokens, stop sequence, end-of-sequence token)
Inference optimization techniques:
- KV cache: store computed key-value pairs to avoid recomputation on each token
- Quantization: reduce model precision (FP16, INT8, INT4) for faster inference with minimal quality loss
- Batching: process multiple requests simultaneously to maximize GPU utilization
- Speculative decoding: use a small model to draft tokens, verified by the large model in parallel
- Prompt caching: reuse KV cache for repeated prompt prefixes
Inference performance metrics:
- TTFT (Time to First Token): latency before the first output token appears
- Tokens per second (TPS): generation throughput
- Throughput: total tokens processed per second across all concurrent requests
Inference with a local model
# Local inference with Ollama
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3:8b",
"prompt": "What is a VLAN?",
"stream": False,
"options": {
"temperature": 0.3,
"num_predict": 256, # Max tokens to generate
}
})
result = response.json()
print(f"Response: {result['response']}")
print(f"Eval duration: {result['eval_duration'] / 1e9:.2f}s")
print(f"Tokens/sec: {result['eval_count'] / (result['eval_duration'] / 1e9):.1f}")# Measure inference performance
$ ollama run llama3:8b --verbose "Explain DNS in one sentence"
# Shows: prompt eval rate, eval rate (tokens/sec), total duration
# Check GPU memory usage during inference
$ nvidia-smi
| GPU Mem-Usage GPU-Util |
| 0 5.2G/8.0G 85% | # 8B model fits in 8GB VRAM at Q4 Inference is where AI delivers value and where costs accumulate. Cloud AI providers (Anthropic, OpenAI, Google) price APIs per input and output token, making inference cost the primary concern for production AI applications. A chatbot handling 10,000 conversations per day at $0.01 per conversation costs $3,000/month in inference alone. Optimization matters: prompt caching can reduce costs by 90% for applications with repeated system prompts. Running inference locally with Ollama or vLLM on consumer GPUs is viable for smaller models (7B-13B parameters) and eliminates per-token costs. The GPU shortage and high inference demand have made NVIDIA GPUs the most valuable hardware in data centers, driving the company’s trillion-dollar valuation.