Read a representative LLM training config (LR, warmup, clip, weight decay) and explain what each knob is preventing from going wrong.

Cross-entropy + AdamW on a deep transformer isn't enough — the loss diverges, the gradients explode, or the model overfits without specific engineering tricks.

7 min

forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling· whole pipeline

Glossary · 6 terms

AdamW: The dominant optimizer for LLM pretraining. Adam (adaptive per-parameter step) plus decoupled weight decay (the W).
learning rate warmup: Linear ramp from 0 to lr_max over the first ~1-10% of training. Lets the optimizer's moving averages stabilize before taking large steps.
cosine decay: After warmup, the LR follows half a cosine wave from lr_max down to lr_min over the remaining steps. Standard for LLM pretraining.
gradient clipping: If ||g|| > c (typically 1.0), rescale g by c/||g||. Cap the step size when a bad batch produces a huge gradient.
weight decay: A penalty proportional to ||θ||^2 added to the loss (or, in AdamW, subtracted from the weights directly). Pulls weights toward zero, regularising.
dropout: Randomly zero a fraction of activations during training to prevent over-reliance on any one feature. Common in older models, often unused in modern LLM pretraining where dataset diversity does the regularisation.

Scaling & regularization: making the training loop actually converge

Chapter 13 reduced training to one line of cross-entropy. That description is correct and it is spectacularly insufficient as a recipe. A 24-layer transformer trained on hundreds of billions of tokens with naive SGD will: diverge in the first 100 steps; have its loss spike to NaN (Not-a-Number — the arithmetic blew up past what a float can hold) from a single bad batch; settle into a local minimum that generalizes poorly (does well on the training text but badly on text it never saw). The fix is a small handful of engineering tricks — none of them are part of the model architecture, but every modern training run uses all of them.

First, what does “scale” even mean here? The word in the title is a parameter count. A straight line y = a·x + b has two. This model has nearly a billion. GPT-3 (2020) already reached hundreds of billions — and the largest research models today are well into the trillions. The ladder below puts all of them on one log axis so the gulf is legible — and so it is honest about where this model actually sits.

The parameter ladder (log scale)

≈ 205× this model

fewer parametersmore parameters

The jump from GPT-2 to GPT-3 was almost pure scale: the same recipe, ~100× the parameters, and the output went from broken-but-grammatical to startlingly coherent.

Read this as raw parameter count only. GPT-3 is a dense transformer — every layer is softmax attention. This model is hybrid: just 6 full-attention layers and 18 GatedDeltaNet (linear-recurrent) layers. So a bigger bar does not mean “more of the same attention” — it is a different shape of network, not a scaled-up copy.

Here is why this rung lives in this chapter: piling on parameters does not automatically overfit. Bigger models trained on enough text generalise better, not worse — which is exactly why the weight-decay and dropout knobs below are a live question, not an obvious win. You tune regularisation knowing the model is already huge and the data is already enormous.

Before the tricks, one precondition the whole scaling story rests on: transformers are worth scaling because their training compute is almost entirely matrix multiplication over every position at once. An RNN must finish token i before it can touch token i+1; a transformer processes the whole sequence in one parallel pass — the causal mask is what keeps that pass honest — and giant batched matmuls are exactly the workload GPUs are built for. (Our model's GatedDeltaNet layers do run token-by-token at decode time; the parallelism win is about training and prefill.) That's the real purchase the architecture made: not any single clever behavior, but computation cheap enough to scale until clever behavior emerges.

The optimizer: AdamW, not SGD

Plain stochastic gradient descent (θ := θ - η·g — nudge each weight θ against its gradient g, scaled by a learning rate η) doesn't work well for transformers. Different parameters see gradients of wildly different magnitudes — a single global step size is either too big for the loud parameters or too small for the quiet ones. Adam tracks per-parameter running averages of g and g², then takes a step normalized by √g², so each parameter's step is rescaled by its own gradient history rather than one shared rate. A “running average” (the paper calls them moments) is nothing fancy: avg ← 0.9·avg + 0.1·g — keep 90% of yesterday's estimate, blend in 10% of the new gradient. AdamW adds decoupled weight decay: instead of penalising ||θ||² through the loss, the optimizer subtracts a small fraction of θ from itself at every step. Why pull weights toward zero at all? Big weights let one feature shout over everything; keeping them small forces the model to spread its evidence across many features. This is the standard recipe for every modern LLM pretrain.

The learning-rate schedule: warmup + cosine

Here's the most universally-used trick in LLM training. The learning rate is not a constant — it follows a hill shape:

Linear warmup from 0 to lr_max over the first ~1-10% of training.
Cosine decay from lr_max down to lr_min (≈ 1e-5) over the rest.

Warmup exists because AdamW's running averages need a few hundred steps of gradient history to be meaningful; taking a full-size step on step 1 is essentially shooting in a random direction. Cosine decay exists because late training benefits from progressively smaller steps — the model is closer to a minimum and large steps would bounce it out.

Learning-rate schedule — linear warmup + cosine decay

The standard recipe: linear warmup for ~10% of steps from 0 to lr_max, then cosine decay down to ~1e-5. The warmup keeps early gradients from blowing up while the optimizer's moving averages haven't filled in yet; the decay lets late training settle into a low-loss plateau.

peak LR (lr_max)3.0e-4

warmup steps (10% of total)1,000

step cursorstep 2,500 · lr = 2.81e-4

Slide warmup to 0 and watch the curve start at lr_max — that's what happens without warmup. Slide the peak LR much higher (e.g. 1e-3) and you can see why early training would diverge without gradient clipping: any large step on a fresh model takes the weights somewhere they can't recover from.

Warmup vs no warmup — the loss curve

aggressive peak LR 1e-3 (stress test) · 10,000 steps

A large enough learning rate applied to near-random initial weights can produce an enormous first step that sends the loss up instead of down. Warmup ramps the LR from 0 up to lr_max over the first ~1–10% of training so those early steps are gentle; cosine decay then takes over. The no-warmup curve below is a deliberately aggressive case — peak LR 1e-3, well above the 3e-4 the schedule above uses — to make the blow-up visible; gentler runs may wobble and recover rather than diverge outright. It is the shape behind many a “loss exploded on step 1” story.

lr (t) = lr_{m a x} \cdot \frac{t}{W} (t < W)

with warmup — loss decays smoothly from 6.5 toward a plateau near 2.0. Early steps are tiny while AdamW's moving averages fill in, so nothing blows up.

Illustrative — both curves are scripted formulas (exponential decay vs a clamped blow-up), not live training output. The loss axis is compressed for legibility: a true random-init cross-entropy for a vocab this large is ≈ ln(248,320) ≈ 12.4, not 6.5.

Gradient clipping: the one-line guardrail

A single bad batch — say, a piece of text that's all whitespace, or a tokenizer edge case — can produce enormous gradients. Without protection, AdamW dutifully takes a giant step in that direction and the loss jumps from 2.5 to 8.0, and the model needs hundreds of steps to recover (if it recovers at all).

Clip-by-norm is the universal answer. Compute the global L2 norm of every gradient in the model, and if it exceeds the clip threshold c (almost always 1.0), rescale every component by c / ||g||. Direction preserved, magnitude capped, training continues smoothly. The widget below lets you tweak both knobs.

Gradient clipping — rescale when the norm exceeds the threshold

After backprop computes per-parameter gradients, take their global L2 norm ||g|| — square every component, sum, square-root. If it exceeds the clip threshold c, rescale every component by c / ||g||. The direction is preserved; only the magnitude is capped. Worked tiny: g = [3, 4] → ||g|| = √(9 + 16) = 5; clip to c = 1.0 → multiply by 1/5 = 0.2 → [0.6, 0.8] — same direction, one-fifth the length.

raw gradient (per param)

0.40

-0.20

0.10

-0.35

0.25

0.15

-0.30

0.20

||g|| = 0.74

after clip-norm

0.40

-0.20

0.10

-0.35

0.25

0.15

-0.30

0.20

||g_clipped|| = 0.74

raw gradient scale×0.05

clip threshold c1.00

Slide the gradient scale up — past ~0.07 you'll see ||g|| exceed the threshold and the clipped panel turn amber. Without this guardrail, a single bad batch (mid-sequence loss spike) can drive a 24-layer stack's weights into a regime training can't recover from. c = 1.0 is the LLM-pretraining default.

Regularization: dropout's slow exit

The original Transformer paper used dropout aggressively — every attention layer, every MLP, every residual connection. Modern LLM pretraining configs typically set dropout to 0. Two reasons:

Pretraining is data-rich. A model sees each token roughly once across a multi-trillion-token corpus. There is no "memorize the training set" failure mode for dropout to prevent.
Weight decay covers most of the same ground. AdamW's decay term pulls weights toward zero, preventing any one feature from dominating.

Fine-tuning is a different story — a small curated dataset can be overfit (the model memorizes those specific examples instead of learning patterns that transfer), and dropout often reappears at non-zero values (typically 0.05-0.1) in the fine-tuning recipe.

Reading a representative config

The combination of tricks above turns a training run from "diverges immediately" into "converges at all." A representative pretraining config — generic, not any specific model's published settings — looks roughly like:

optimizer:   AdamW(beta1=0.9, beta2=0.95, eps=1e-8, weight_decay=0.1)
lr:          3e-4 peak, 2000 warmup steps, cosine decay to 1e-5
grad_clip:   1.0  (clip by global norm)
dropout:     0.0  (pretraining)
batch_size:  4M tokens (gradient accumulation across many devices)
seq_len:     8192
total_steps: 500,000

Two of those numbers are worth multiplying out. A “4M-token batch” is 512 sequences × 8,192 tokens = 4,194,304 ≈ 4M tokens per optimizer step. And the whole run is 4M × 500,000 steps ≈ 2 trillion tokens — those are the “trillions” the training chapter kept mentioning.

The less-obvious knobs in that block, in plain terms:

beta1 = 0.9, beta2 = 0.95 — how fast AdamW's two running averages forget. beta1 smooths the gradient g; beta2 smooths the squared gradient g². Higher = longer memory.
eps = 1e-8 — a tiny constant added in the denominator so the step never divides by zero when a parameter's g² average is near zero.
weight_decay = 0.1 — the strength of the pull-toward-zero described above; 0.1 is a typical pretraining value.
gradient accumulation — sum the gradients from several small batches before taking one optimizer step, so a handful of GPUs can imitate one enormous 4M-token batch they could never fit in memory at once.

Every one of those lines is a guardrail against a specific failure mode discovered the hard way during the last decade of LLM training. The architecture is the model; this is the recipe that makes the architecture finishable.

Where to learn more

This chapter is deliberately the lightest one in the course — you can train an LLM without internalising every detail here, but you can't read a research paper without recognising these knobs. For the trainable side of this codebase, see @mlx-node/trl (GRPO and SFT) and crates/mlx-tui (the mlx-train TUI binary) — they implement the same recipe on Apple Silicon.

Engineering takeaways

A deep transformer doesn't train with a fixed learning rate — warmup-then-cosine is the standard schedule and the reason early loss curves stay sane.
Gradient clipping is a one-line guardrail that turns "model diverged on batch 3,247" into a small bump in the loss curve.
Dropout is mostly gone from modern LLM pretraining; weight decay + dataset scale + early-stopping discipline replaced it.

Try this

In the LR widget, set warmup to 0 and peak LR to ~1e-3, then look at the start of the curve. What would happen to a fresh model if you started training at that LR? Why does the cosine half of the schedule exist?

Quick check

1. Why does LLM training start with a learning-rate warmup?

It gives the GPU time to warm up before the real training starts.The optimizer's running averages of gradient and squared gradient haven't filled in yet — large steps early on cause divergence.Warmup is a regulatory requirement for foundation-model training.

2. What does 'clip gradient norm to 1.0' do?

If ||g|| > 1.0, rescale every component by 1.0 / ||g||. Direction preserved, magnitude capped.Set any individual gradient component to ±1.0 if it exceeds that range.Drop the largest-magnitude gradient component entirely.

3. Why don't modern LLM pretraining configs use dropout the way 2017-era models did?

Dropout is mathematically equivalent to RMSNorm; modern models use one or the other.Pretraining on trillions of unique tokens is so data-rich that the model has no opportunity to overfit any single example; dropout becomes redundant and slows training.Dropout breaks the residual stream.