Transformer
The Transformer is the engine inside every modern AI language model. Before Transformers, AI processed text one word at a time (like reading a book word by word). Transformers can look at all the words simultaneously, understanding how every word relates to every other word in the text. This “attention” mechanism is what lets AI understand that “bank” means something different in “river bank” versus “bank account” by looking at the surrounding words.
The Transformer is a neural network architecture introduced in “Attention Is All You Need” (Vaswani et al., 2017). It replaced recurrent neural networks (RNNs/LSTMs) by processing entire sequences in parallel using self-attention, enabling dramatically better performance and scalability.
Core components:
-
Self-Attention (Scaled Dot-Product Attention): for each token, compute attention scores against all other tokens. The output is a weighted sum of value vectors, where weights reflect how relevant each token is to the current one.
- Q (Query), K (Key), V (Value) matrices derived from input embeddings
- Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V
- Complexity: O(n^2 * d) where n = sequence length, d = embedding dimension
-
Multi-Head Attention: run multiple attention operations in parallel with different learned projections (typically 32-128 heads). Each head can learn different relationship types (syntactic, semantic, positional).
-
Feed-Forward Network: two linear layers with a nonlinearity (GELU/SiLU) applied independently to each token position. Typically 4x the embedding dimension.
-
Layer Normalization + Residual Connections: stabilize training and enable gradient flow through deep networks (96-128+ layers).
-
Positional Encoding: since attention is permutation-invariant, positional information is injected via sinusoidal functions (original), learned embeddings, or Rotary Position Embeddings (RoPE).
Variants:
- Encoder-only (BERT): bidirectional attention, good for classification and understanding
- Decoder-only (GPT, Claude, LLaMA): causal (left-to-right) attention, good for generation
- Encoder-decoder (T5, BART): cross-attention between encoder and decoder, good for translation
Simplified attention mechanism
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""Core attention computation from 'Attention Is All You Need'"""
d_k = K.shape[-1]
# Compute attention scores
scores = np.matmul(Q, K.T) / np.sqrt(d_k)
# Softmax to get attention weights (how much to attend to each token)
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
# Weighted sum of values
output = np.matmul(weights, V)
return output, weights
# Example: 3 tokens, embedding dim 4
Q = np.random.randn(3, 4) # Query: what am I looking for?
K = np.random.randn(3, 4) # Key: what do I contain?
V = np.random.randn(3, 4) # Value: what information do I provide?
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Attention weights:\n{weights.round(3)}")
# Each row shows how much each token attends to every other token The Transformer architecture is arguably the most impactful machine learning innovation of the last decade. It powers every major LLM (Claude, GPT, Gemini, LLaMA), computer vision models (ViT), protein structure prediction (AlphaFold), speech recognition (Whisper), and code generation (Codex, StarCoder). The key insight, that attention can replace recurrence for sequence modeling, unlocked massive parallelism in training (GPUs process all tokens simultaneously) and enabled scaling to hundreds of billions of parameters. Researchers continue to improve the architecture: flash attention reduces memory usage, sparse attention patterns extend context lengths, and mixture-of-experts (MoE) architectures activate only a subset of parameters per token. Understanding the Transformer is essential for anyone working with or building on top of AI systems.