Chapters

All 16 chapters of How LLMs Work — from tokenization and embeddings to attention, training, and the full model architecture, each taught with a real LLM running live in your browser.

1. What is an LLM? — The big picture: an LLM is one function — tokens in, a score for every next token — run in a loop.
2. Tokenization — What is a token? Watch Qwen's BPE tokenizer slice a string into sub-words.
3. Embeddings — Tokens become vectors. A PCA scatter of the model's actual embedding matrix.
4. Self-attention — softmax(QKᵀ / √d) V. The mechanism that lets every token look at every other one.
5. Multi-head & GQA — Why heads exist, and how Qwen3.5 shares KV across them with grouped-query attention.
6. Positional encoding (RoPE) — How the model knows token order, visualized as a rotation per dimension pair.
7. RMSNorm — Why normalize? Pre- and post-norm activation distributions for a real layer.
8. MLP block — Gated MLP and residual connections — the model's per-token feed-forward step.
9. Full transformer block — Attention + Norm + MLP + Residual. The 3D rotatable stack overview.
10. LM head & weight tying — How last_hidden becomes logits — one matmul, and the same matrix as the embedding.
11. Sampling — Logits → softmax → token. Live top-k bar chart, with temperature and top-p sliders.
12. KV cache & hybrid attention — Why inference is fast, and how Qwen3.5 interleaves linear and full attention.
13. Training & teacher forcing — How the weights got there: next-token cross-entropy, parallel positions, ground-truth inputs.
14. Scaling & regularization — LR warmup + cosine, gradient clipping, weight decay — the engineering that makes training converge.
15. From base model to assistant — Pretraining, instruction tuning, and chat templates — how autocomplete becomes a helpful assistant.
16. The whole model, end to end — Every piece on one interactive map — then the honest limits of what the finished model can do.