Explain what an LLM is actually trained to do: predict every next token in parallel under a causal mask, against ground-truth labels.
We've watched a trained model run. Where did all the weights come from? What was the model optimized to do?
Glossary · 10 terms
- next-token prediction
- The training task: given tokens [t1..ti], predict t_{i+1}. Repeated at every position of every training sequence.
- gradient descent
- The update rule behind training: the gradient points uphill — the direction that increases the loss fastest — so each weight is nudged the opposite way (downhill): w ← w − lr·gradient. The learning rate (lr) is the step size: too small crawls, too large overshoots and diverges. AdamW runs this per-parameter for all ~800M weights at once.
- batch
- Several to many training sequences processed together in one step (real pretraining batches are commonly hundreds of sequences / millions of tokens, assembled via gradient accumulation — see Chapter 14). The loss is averaged across them, so one weight update reflects many examples instead of one — steadier gradients, better hardware use.
- cross-entropy loss
- Loss for a single prediction: −log p(target). Sums or averages across positions and batch elements to get the total loss.
- teacher forcing
- During training, the input at position i+1 is the true token from position i (not the model's predicted token). This supplies a valid ground-truth target at every position, so the model learns from correct context instead of its own (initially terrible) guesses.
- one training step
- Forward pass on a batch of sequences → compute loss → backpropagate → optimizer updates weights. Repeated hundreds of thousands to a few million times over a pretraining run (each step chews through millions of tokens, so the token count reaches into the trillions even though the step count doesn't).
- parallel positions
- Thanks to the causal mask, every position in a sequence can be trained in a single forward pass — the model never sees the future, so the predictions stay valid.
- pretraining vs fine-tuning
- Pretraining: next-token prediction on trillions of tokens of generic text. Fine-tuning: same objective, smaller curated dataset, typically post-pretraining.
- AdamW
- The optimizer almost universally used for LLM training. It keeps a per-parameter running average of the gradient and its square (the "moments") so each weight gets an adaptive step size, with weight decay applied separately from the gradient.
- backpropagation (autograd)
- The algorithm that computes the gradient of the scalar loss with respect to every weight by applying the chain rule backward through the forward pass. The playground below runs real autograd on WebGPU — the same value_and_grad primitive the production trainer uses.
Training: what an LLM is actually optimized to do
So far this course has been about inference: a trained model takes a prompt and emits one token at a time. But where did the weights come from? What did the optimization actually solve for? The answer is surprisingly simple, and it's the same answer for every modern decoder LLM — GPT, Llama, Qwen, Gemma, Mistral. They all train on the same objective.
What “training” even means
Before the formula, the picture. Think of the model as a panel of tunable knobs. A wall thermostat has one: turn it up, the room gets warmer; turn it down, it cools. You find the right setting by nudging — a little too cold, bump it up; overshoot, ease it back — until the room feels right. Training is that same loop, with one difference: instead of a human reading the room, an automatic procedure reads how wrong the model's predictions are and nudges every knob a hair in the direction that makes them less wrong. Nobody hand-picks the values; they are found by repeating that nudge.
The catch is the number of knobs. Fit a straight line to a scatter of points and you are tuning two knobs — a slope and an intercept. Step up the ladder and the count explodes: the Qwen3.5 model running in your browser has 852,985,920 of them, each a single number that backpropagation will nudge. (For a sense of how far this ladder goes, GPT-3 carried 175 billion knobs — about 200× more; the per-knob nudge is identical, there are just vastly more of them, which is the whole story of scaling.) The error that those nudges are driving down — the single number the whole procedure reads to decide which way to turn each knob — is exactly the −log p below.
The objective in one line
That's it. For every position in every sequence, predict the next token; minimize the negative log probability of the true next token. Averaged over positions within a sequence (as shown) and averaged again over the sequences in the batch (real pretraining batches are commonly hundreds of sequences / millions of tokens — see Chapter 14), this is what the optimizer pushes down. No reward model, no human in the loop (at pretraining time) — just enormous quantities of text and a shift-by-one target.
The per-position loss -log p(target) is called cross-entropy. When p(target) is close to 1 the loss is close to 0; when p(target) is small the loss blows up. Why the log specifically? Two reasons: it turns “multiply the probability of every token across the whole sequence” into “add up a per-token cost” (a product of thousands of numbers below 1 would underflow to zero; a sum of logs stays well-behaved), and because -log p → ∞ as p → 0, a confidently-wrong prediction is punished arbitrarily hard — the model is penalized most exactly when it is both wrong and certain. The optimizer's job is to push the model's probability mass onto the actual next token.
So where does the loss start? At initialization the weights are random, so every one of the 248,320 vocabulary tokens gets roughly the same probability — about 1/248,320. Plug that in: −ln(1/248,320) = ln(248,320) ≈ 12.4. That is the loss before the model has learned anything, and every drop below it is learning. The live training playground on this page starts the same way, just with a tiny character vocabulary: its default corpus has only 3 distinct characters, so its curve starts near ln(3) ≈ 1.1 (the two 11-character corpora start near ln(11) ≈ 2.4).
Teacher forcing — the trick that makes it parallel
A naive interpretation of "predict the next token" might run the model token by token: generate prediction 1, feed it back, generate prediction 2, and so on — exactly how inference works. That would be brutally slow at training time, and worse: the model would be learning from its own (initially terrible) predictions.
The fix is teacher forcing. Instead of feeding the model's own prediction back as the next input, feed the true token. That's all teacher forcing does: it supplies a valid ground-truth target at every position, so the model learns from correct context rather than its own initially-terrible guesses. The parallelism comes from a separate mechanism — the causal mask (chapter 4). Because each position can only attend backward, every position's loss can be computed in a single forward pass without any position peeking at its own target. Ground-truth targets (teacher forcing) + single-pass parallelism (causal mask) together let all N positions train at once.
Read the loss bars against the probabilities above them — the bridge is just −ln: −ln(0.78) = 0.25 (easy position, tiny bar) but −ln(0.28) = 1.27 (hard position) — low probability blows the bar up.
Two crucial observations. One: every position is trained simultaneously — the causal mask is the only thing that keeps it valid. Two: the input fed to position i+1 is the true token from position i, not the model's prediction. That's "teacher forcing": during training, the model never has to recover from its own mistakes. (At inference, of course, it does — which is the small but real reason long-form generation sometimes drifts.)
Illustrative — the per-position probabilities here are hand-picked to show the shape of the loss, not live output from the model.
Illustrative — the divergence is scripted, not live output from the model.
Teacher forcing means training never lets the model see its own mistakes: at every position the input is the true previous token. But at inference the model must consume its own outputs, so one off-gold choice shifts the context onto a path it was never trained on, and small errors compound into drift. That gap between the teacher-forced training distribution and the free-running inference distribution is exposure bias.
Inference and training are the same forward pass
A consequence worth pausing on: there is no separate "training-time" architecture. The same forward pass — embedding lookup, 24 transformer layers each mixing tokens (6 via full attention, 18 via GatedDeltaNet) and then transforming them with an MLP, RMSNorm at the top, LM head matmul — is what runs during both training and inference. Training just does it on a whole sequence at once and scores the output against ground truth; inference does it one token at a time and samples from the output.
This symmetry is part of why the inference patterns from this course generalize to training. The KV cache (chapter 12) is an inference-only optimization, but the operations being cached are exactly the same operations that run during training, just without the cache.
Pretraining vs fine-tuning vs instruct/RLHF
The objective above describes pretraining: trillions of tokens of generic web text, books, and code, with the model learning to predict the next token in arbitrary contexts. This is where Qwen3.5's ~0.8B parameters get most of their knowledge — and where almost all the GPU-hours of LLM development go.
Fine-tuning uses the same loss, the same forward pass — just a different, smaller, curated dataset (e.g. instruction-following dialogues). The model continues to predict the next token; the dataset is what shifts.
RLHF / DPO / direct preference methods diverge from the simple cross-entropy story — they score model outputs against human preferences or a reward model, and the loss is no longer plain next-token cross-entropy. Those are out of scope for this chapter; the important point is that the base behavior of every one of these post-training stages still rests on the next-token cross-entropy foundation laid here.
What the optimizer actually does
The loss is one scalar. Backpropagation — working the chain rule backward through the forward pass — computes its gradient with respect to every weight in the model (~800M of them for Qwen3.5-0.8B). Here is the whole algorithm without the calculus — the loss flows back down the stack one block at a time:
Forward pass done (faint arrows going up): the whole model boiled the batch down to one number, the loss. Now that number flows back down the same blocks.
That question-and-answer hand-off is the chain rule in action — each block only needs to know how its own output moved the loss, never the whole model at once. Backprop only computes the gradients; the optimizer (AdamW) is the separate step that turns them into actual weight nudges.
That single backward pass is architecture-blind. The chain rule does not care that 6 of Qwen3.5's 24 layers run softmax attention while the other 18 are linear-recurrent GatedDeltaNet — every weight in both kinds gets its gradient from the same downhill walk, which is exactly why a single AdamW loop trains the two flavors side by side without ever knowing which is which.
Once every block has its gradient, an optimizer (almost always AdamW, broken down in Chapter 14) takes a small step in the direction that lowers the loss. Strip it down to a single weight and the rule is easy to see: the gradient points uphill, so step the opposite way — and the size of that step (the learning rate) has a sweet spot.
The gold tangent is the gradient L'(w) — its sign says which way is uphill, so the step (the pink arrow) moves w the opposite way, downhill. A tiny lr barely moves and crawls; a good one drops to the bottom in a few steps; push lr past 2.0 and each step overshoots farther than the last — the weight bounces out of the bowl and the loss explodes.
Illustrative — a 1-parameter cartoon. A real model has ~800M weights, and the same rule (step each weight downhill by its own gradient) runs on all of them at once — which is exactly what AdamW automates per parameter.
Repeat that per-weight step for all ~800M weights at once, for several million to several trillion tokens. The next chapter covers the engineering tricks — warmup, cosine decay, gradient clipping, dropout — that make this loop numerically stable across a stack this deep.
- The entire training objective is one line: minimize -log p(next_token) at every position of every sequence.
- The causal mask is what makes training parallel — every position computes its loss in one forward pass without peeking at the future; teacher forcing supplies the ground-truth target at each position so those parallel predictions are learning from correct context.
- There is no separate "training mode" architecture — the same forward pass you run at inference is run during training, just scored against ground truth.
In the animation, position 5 (the last input " mat") shows "no target". Why is the last position never trained, and what would happen if you tried to train it anyway?