general artificial-intelligence

Inference

Updated April 15, 2026

inference ai machine-learning deployment

Plain English

Training teaches an AI model by showing it millions of examples. Inference is when the trained model actually does its job: you give it a question and it generates an answer. Every time you chat with Claude, ask GPT to write code, or use an AI image generator, you are running inference. Training happens once (and costs millions of dollars); inference happens billions of times a day and is what you pay for per token when using AI APIs.

Technical Definition

Inference is the forward pass of a trained model: processing input data through the network to produce an output (prediction, classification, generation). In the context of LLMs, inference is the autoregressive token-by-token generation process.

Training vs. inference:

Aspect	Training	Inference
Purpose	Learn parameters from data	Generate output from input
Compute	Massive (thousands of GPUs, weeks)	Moderate (single GPU, milliseconds per token)
Cost	Millions of dollars	Cents per request
Frequency	Once (or periodically)	Continuously, billions of times
Gradient computation	Yes (backpropagation)	No (forward pass only)

LLM inference pipeline:

Tokenize input text into token IDs
Prefill: process all input tokens in parallel, build the KV cache
Decode: generate output tokens one at a time, each attending to all previous tokens via the KV cache
Sample: select the next token from the probability distribution (temperature, top-p, top-k control randomness)
Repeat steps 3-4 until stop condition (max tokens, stop sequence, end-of-sequence token)

Inference optimization techniques:

KV cache: store computed key-value pairs to avoid recomputation on each token
Quantization: reduce model precision (FP16, INT8, INT4) for faster inference with minimal quality loss
Batching: process multiple requests simultaneously to maximize GPU utilization
Speculative decoding: use a small model to draft tokens, verified by the large model in parallel
Prompt caching: reuse KV cache for repeated prompt prefixes

Inference performance metrics:

TTFT (Time to First Token): latency before the first output token appears
Tokens per second (TPS): generation throughput
Throughput: total tokens processed per second across all concurrent requests

Inference with a local model

# Local inference with Ollama
import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3:8b",
    "prompt": "What is a VLAN?",
    "stream": False,
    "options": {
        "temperature": 0.3,
        "num_predict": 256,  # Max tokens to generate
    }
})
result = response.json()
print(f"Response: {result['response']}")
print(f"Eval duration: {result['eval_duration'] / 1e9:.2f}s")
print(f"Tokens/sec: {result['eval_count'] / (result['eval_duration'] / 1e9):.1f}")

# Measure inference performance
$ ollama run llama3:8b --verbose "Explain DNS in one sentence"
# Shows: prompt eval rate, eval rate (tokens/sec), total duration

# Check GPU memory usage during inference
$ nvidia-smi
| GPU  Mem-Usage  GPU-Util |
|  0   5.2G/8.0G    85%   |  # 8B model fits in 8GB VRAM at Q4

In the Wild

Inference is where AI delivers value and where costs accumulate. Cloud AI providers (Anthropic, OpenAI, Google) price APIs per input and output token, making inference cost the primary concern for production AI applications. A chatbot handling 10,000 conversations per day at $0.01 per conversation costs $3,000/month in inference alone. Optimization matters: prompt caching can reduce costs by 90% for applications with repeated system prompts. Running inference locally with Ollama or vLLM on consumer GPUs is viable for smaller models (7B-13B parameters) and eliminates per-token costs. The GPU shortage and high inference demand have made NVIDIA GPUs the most valuable hardware in data centers, driving the company’s trillion-dollar valuation.

← Back to Dictionary

Inference

Related Terms