Chapters
All 16 chapters of How LLMs Work — from tokenization and embeddings to attention, training, and the full model architecture, each taught with a real LLM running live in your browser.
- 1. What is an LLM? — The big picture: an LLM is one function — tokens in, a score for every next token — run in a loop.
- 2. Tokenization — What is a token? Watch Qwen's BPE tokenizer slice a string into sub-words.
- 3. Embeddings — Tokens become vectors. A PCA scatter of the model's actual embedding matrix.
- 4. Self-attention — softmax(QKᵀ / √d) V. The mechanism that lets every token look at every other one.
- 5. Multi-head & GQA — Why heads exist, and how Qwen3.5 shares KV across them with grouped-query attention.
- 6. Positional encoding (RoPE) — How the model knows token order, visualized as a rotation per dimension pair.
- 7. RMSNorm — Why normalize? Pre- and post-norm activation distributions for a real layer.
- 8. MLP block — Gated MLP and residual connections — the model's per-token feed-forward step.
- 9. Full transformer block — Attention + Norm + MLP + Residual. The 3D rotatable stack overview.
- 10. LM head & weight tying — How last_hidden becomes logits — one matmul, and the same matrix as the embedding.
- 11. Sampling — Logits → softmax → token. Live top-k bar chart, with temperature and top-p sliders.
- 12. KV cache & hybrid attention — Why inference is fast, and how Qwen3.5 interleaves linear and full attention.
- 13. Training & teacher forcing — How the weights got there: next-token cross-entropy, parallel positions, ground-truth inputs.
- 14. Scaling & regularization — LR warmup + cosine, gradient clipping, weight decay — the engineering that makes training converge.
- 15. From base model to assistant — Pretraining, instruction tuning, and chat templates — how autocomplete becomes a helpful assistant.
- 16. The whole model, end to end — Every piece on one interactive map — then the honest limits of what the finished model can do.