Chapter 11 · Sampling

Take a vector of logits, apply temperature and top-p, and reason about why each knob shifts the resulting distribution.

Every forward pass ends with logits; the model still needs a rule for turning that vector into one concrete next token.

7 min
forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling
Glossary · 7 terms
logit
One unbounded real-valued score per vocab entry. The raw output of the model before any normalization.
softmax
exp(l_i / T) / sum_j exp(l_j / T) — turns logits into a probability distribution that sums to 1.
temperature (T)
Sharpness knob applied before softmax. T<1 sharpens (closer to greedy); T>1 flattens (more diverse).
greedy decoding
Pick argmax(logits) every step. Deterministic and reproducible; tends to fall into repetition loops.
top-p (nucleus)
Keep the smallest set of tokens whose probabilities sum to >= p, renormalize, sample from those. Trims the long tail.
top-K sampling
A sampling rule: keep only the K highest-probability tokens, renormalize, sample from those. A fixed-size cousin of top-p.
top-16 capture (this demo)
Display truncation, not sampling: the inspector records the 16 highest logits per step so the widget has something to draw. The model itself still sampled over the full vocabulary.

Sampling: turning logits into the next token

Every forward pass through Qwen3.5 ends the same way: a vector of logits — one real number per token in the model's vocabulary (~248k). A logit is an unbounded score: how strongly the model recommends that token as the next one. Sampling is the step that turns that vector into a single concrete choice.

Why not just pick the maximum?

The simplest rule is greedy decoding: take argmax(logits) at every step — argmax just means "pick the position with the largest value", i.e. the single highest-scoring token. It's deterministic and reproducible, but it has a famous failure mode — repetition. The instant the model finds a phrase whose continuation it's confident about, it keeps re-entering the same loop, because that loop is always the locally-best choice. Greedy decoding also throws away a lot of information: if two tokens have nearly equal logits, picking one and ignoring the other is a fragile tie-break.

Softmax: logits → probabilities

To sample, we first convert logits into a proper probability distribution with the softmax function:

Two things are happening here. The exponential makes every value positive; the division by the sum makes them sum to 1. Each logit is also divided by a temperature T before the exponential — the same T appears in every term, top and bottom. Temperature acts as a sharpness knob:

  • T < 1 sharpens the distribution — high-logit tokens dominate even more, so output looks more like greedy.
  • T > 1 flattens the distribution — small differences in logits are washed out, so output is more diverse but less coherent.
  • T = 0 collapses to greedy (argmax). The widget clamps to a tiny positive value to avoid dividing by zero, which is numerically equivalent.

Why this exact function?

Softmax is really a smooth, differentiable stand-in for argmax. Where argmax snaps hard to the single largest logit, softmax slides between the candidates. Feed it [5.0, 4.9, 1.0] and you get roughly [0.52, 0.47, 0.01]: the two close leaders share the mass almost evenly while the third stays negligible. Widen the gap to [8.0, 4.9, 1.0] and the output sharpens to about [0.96, 0.04, 0.00] — nearly a hard argmax, which is exactly the limit greedy decoding (T = 0) walks toward.

That smoothness is also why softmax turns up far beyond sampling: because it has a clean gradient everywhere, it is the function the model is trained against, and it is the same normalization that weighs the keys inside attention.

To see why the distribution comes out the way it does, here is the computation broken into its four mechanical phases — raw logits, divide by T, exponentiate, normalize — looping one step at a time. Use the T buttons to run the same eight logits at 0.2, 0.7 and 2.0: the normalize-phase bars spike onto one token at 0.2 and flatten across all eight at 2.0.

Softmax with temperature, phase by phase
Temperature
sharpens — closer to greedy

z — the model’s unbounded scores, straight from the LM head.

tokenraw logits
sunny3.10
cloudy2.40
warm1.90
cold1.20
nice0.90
rainy0.40
mild-0.20
grey-0.80
Winning token: sunny (argmax — unchanged by these monotonic steps)
Raw logits can be negative, so there is nothing to draw as bars yet — exponentiating (step 3) makes every value positive.

Illustrative eight-token logits, scripted to show the four phases — not live output from the model. Bars appear once the values are non-negative (after exponentiating); every number shown is the true per-phase value. Switch T and re-watch the normalize phase: at 0.2 nearly all mass piles onto the top token, at 2.0 it spreads across all eight.

Same seed, three temperatures

The bars above are the cause; here is the effect. We gave the real Qwen3.5-0.8B checkpoint — the same one this playground runs — the same seed text three times and only changed the temperature. At T = 0 the sampler is greedy: every step takes the single most probable token, so the output is identical on every rerun and tends toward the safest, blandest phrasing — notice how quickly it starts orbiting the same few words. At T = 0.7 the distribution is sharpened but not collapsed, and the sampled story — this run, at least — reads naturally; reruns differ.

The T = 1.5 run shows why high temperature degrades so fast: generation is a loop, and every sampled token is appended to the context that produces the next distribution. One improbable pick would be survivable — but at high temperature improbable picks pile up, each one dragging the context further from anything the model has seen — in this run the errors compound until the output is word salad.

Same seed, three temperatures · real model output
One recorded run each · 70 new tokens · this exact Qwen3.5-0.8B checkpoint
T = 0greedy
Seed: Once upon a time, there was a very special kind of animal called a "pocketed" animal. These animals are very special because they have a special way of living. They live in a very special place called a "pocketed" habitat. Imagine a pocketed animal as a little explorer who lives in a very special place. This place is called a "pocket

Deterministic — rerun it and you get this exact text again. Safe but circling: "very special" four times, and "pocketed" is already a loop.

T = 0.7sampled
Seed: Once upon a time, there was a curious child named Ethan. One sunny morning, Ethan decided to explore the world around him and he found a strange object. Ethan picked up a wooden block and put it in a circular hole. He used a different material for the block, which made it feel like a solid, but he didn't know why. Then, he put the

Mild randomness is enough to break out of the safest rut — a named character, a small plot. Still coherent; a rerun gives a different continuation (this is one sample, not a guarantee).

T = 1.5too hot
Seed: Once upon a time, there was a boy named Benobils who lived loving today though I Homer gonna doodproof way since Dad Giovanni will offer him up to USA pilgrimage I let Homework set come don accept Pescidio submit these"` Do español Paramal Decindo Unless Lupens Journals few holders Bel En долга Cuban Borders programa greatly lario routine gym training regulwoothy Numero dai training So much

Each improbable pick becomes context for the next step, so the damage compounds — within a couple of lines it has drifted through three languages into word salad.

Real pre-recorded output, not scripted: each continuation comes from running this exact Qwen3.5-0.8B checkpoint locally through the native mlx-node Qwen35Model.generate() raw-completion API (no chat template) with maxNewTokens = 70 and pure temperature sampling — no top-p or top-K truncation (top-p = 1.0, top-K off). The T = 0 run is greedy and fully reproducible; the T = 0.7 and T = 1.5 runs are single samples and would differ on a rerun.

Top-p (nucleus) sampling

Even with a sensible temperature, the long tail of the vocabulary still has tiny non-zero probability mass — and once in a while the sampler will land there. Most of those tail tokens are nonsense in context. Top-p sampling (also called nucleus sampling) trims the tail:

  • Sort tokens by probability, descending.
  • Walk the sorted list and keep accumulating probability until the cumulative sum reaches p.
  • Throw everything after that away.
  • Renormalize the survivors so they sum to 1 again.
  • Sample from this nucleus: draw a random number r between 0 and 1, walk down the list adding up probabilities, and stop at the first token whose running total passes r — bigger slices get hit more often.

A common default is T = 0.7 with top_p = 0.9: the temperature gives the model some room to be creative, and top-p guarantees we never sample from the absurd tail. The widget on the right lets you sweep both knobs over a captured run and see what would have happened.

Slide the cutoff below and watch the nucleus form: tokens are kept top-down until their cumulative probability reaches p, the tail is discarded, and the survivors are renormalized so they sum to 1 again.

Nucleus (top-p) cutoff
0.90
Nucleus: 6 of 10 tokensMass kept before renormalizing: 0.925(cutoff: smallest set with cumulative ≥ 0.90)
Original — sorted, tail dimmed past the cutoff
·store
0.330Σ0.33
·park
0.220Σ0.55
·beach
0.140Σ0.69
·gym
0.110Σ0.80
·office
0.070Σ0.87
·movies
0.055Σ0.93
·airport
0.035Σ0.96
·doctor
0.025Σ0.99
·bank
0.010Σ1.00
·moon
0.005Σ1.00
The dashed line is the cutoff: everything below it is the discarded tail.
Nucleus — renormalized to sum to 1, then sampled
·store
0.357
·park
0.238
·beach
0.151
·gym
0.119
·office
0.076
·movies
0.059
·airport
·doctor
·bank
·moon
Survivors rescaled by ÷ 0.925 so they sum to 1 — that rescale is why the kept bars jump up.

Illustrative ten-token distribution, sorted and scripted — not live output from the model.

How the widget works

Pressing Run generates 6 tokens with the inspector capturing the top-16 raw logits at every step — the capture itself always runs at temperature=0, i.e. greedy. The temperature and top-p sliders then re-apply softmax + truncation to the cached logits — no re-running the model. Because dividing every logit by the same T can't change their rank order, and top-p can never drop the #1-ranked token, the highlighted bar is guaranteed to stay the tallest bar under every slider setting — it's always the model's greedy pick for that step. What actually changes as you move the sliders is the confidence gap to the runner-up bars: a wide gap means a real sampler would almost always agree with this greedy pick, a narrow gap means it would often disagree and land on a different token.

One caveat on the heights: the bars are softmax/top-p renormalized over only the captured top-16 logits, not the full 248,320-token vocabulary. The omitted tail still carries real probability mass, so every bar reads a little higher than its true full-vocab probability — most noticeably at high temperature, where the model spreads more mass into that discarded tail.

When sampling goes wrong

Two failure modes are worth seeing side-by-side: a low-temperature greedy run that loops, and a high-temperature run that turns into gibberish. The middle panel shows a sensible production setting.

Sampling failure modes · same prompt, three regimes
Pre-recorded continuations · 10 tokens each
Prompt: "Once upon a time, in a forest far away"
Greedy
T = 0, top-p = 1.0
Once upon a time, in a forest far away there lived a small forest there lived a small forest

Repetition trap. The model finds a high-confidence phrase and loops back into it.

Too hot
T = 2.0
Once upon a time, in a forest far away a frgg moo whirr the of bicycle banana very

Gibberish. The distribution is so flat the sampler picks rare tokens uniformly.

Sweet spot
T = 0.7, top-p = 0.9
Once upon a time, in a forest far away a small village where everyone knew each other and shared their stories

Coherent yet varied. Top-p trims the absurd tail, temperature keeps it from collapsing.

Production LLM serving typically lands somewhere around T = 0.7-1.0 with top-p = 0.9 (or a moderate top-K). The two knobs do different jobs: temperature reshapes the whole distribution, top-p trims the long tail. Together they avoid the greedy loop and the hot-gibberish failure modes you see above.

Tip: try a low temperature (0.2) on a confident step versus a high temperature (1.5) on the same step. The visible "shape" of the bar chart is the entire reason sampling parameters matter for an LLM.

Engineering takeaways
  • Logits become probabilities only after softmax. Within one step their order is what matters — the largest logit is the greedy pick — but the absolute values aren't comparable across different runs or models.
  • Temperature reshapes the distribution; top-p truncates the tail. A common default is T=0.7 with top_p=0.9.
  • The highlighted bar is always the model's greedy (argmax) pick for that step — the captured run uses temperature=0, and rescaling or trimming afterward can never change which bar is tallest. What the sliders do move is the confidence gap to the runner-up bars: a wide gap means a real sampler would almost always agree with greedy, a narrow gap means it would often pick something else.
Try this

After the auto-run, leave temperature at 1.0 and slowly lower top-p from 1.0 to 0.3 while watching the bar chart for step 1. At what point does the chart visibly collapse to one or two bars, and what does the renormalised height tell you?

Quick check
1. Why doesn't production LLM serving just always use greedy decoding?
2. What does setting temperature T = 0.2 do to the probability distribution?
3. Top-p sampling with p = 0.9 means:

Try it now

Loading the interactive demo…