Multi-token prediction

Go deeper · Chapter 11, Sampling — how a model gets more than one token out of a single forward pass.

Sampling, from the last chapter, turns one row of logits into one token. To produce the next token you append it to the prompt and run the entire model again — all 24 layers, every weight. A 200-token reply is 200 full forward passes, strictly one after another, because each token depends on the one before it. That is the autoregressive tax, and it is the single biggest reason generation feels slow.

What makes it hurt is what each pass spends its time on. Producing one token touches every weight in the model exactly once, so a decode step is memory-bound: the GPU spends most of the step just reading the billion-ish parameters out of memory, and the actual math for that single token is a rounding error on top. The chip is mostly idle, waiting on memory. So here is the tempting question: if we are going to pay to stream all those weights in anyway, can we get more than one token for that price?

The bet: draft, then verify

Speculative decoding is the trick that says yes. The idea is to guess the next few tokens cheaply, then have the real model check all the guesses at once in a single forward pass. Checking k guesses costs about the same as producing one token normally — it is one pass over the weights — so every guess that survives is a token you got for free. Guess well and you emit several tokens per pass; guess badly and you have wasted a little cheap drafting, nothing more.

Classically you need a second, smaller “draft model” to do the guessing — an extra model to train, load, and keep in sync. Multi-token prediction (MTP) removes that cost: the model drafts for itself. Qwen3.5 ships one small extra module — the MTP head — whose only job is to propose the next token while the main model is still finishing the current one.

The MTP head Qwen3.5 carries

It is deliberately tiny, because it borrows almost everything from the main model. It takes two inputs: the main model's last hidden state for the current position, and the embedding of the token that was just emitted — and crucially that embedding comes from the same embedding matrix the main model uses (mtp_use_dedicated_embeddings: false). It normalizes both, fuses them with a single projection, runs them through one transformer layer, and reads out a token through the same tied LM head the main model uses. That is the whole head:

Inside one MTP module

Shares the main model's embedding + LM head, so the head is tiny — one transformer layer, three norms, one projection.

Dashed boxes are shared with the main model — no new weights of their own.

config.json: mtp_num_hidden_layers: 1 · mtp_use_dedicated_embeddings: false

Two details matter for later. First, the head's one layer is a full-attention layer — not one of the 18 GatedDeltaNet layers — and it keeps its own little attention cache; it never touches the main model's recurrent state. Second, because the embedding and the output head are shared, the only new weights are three RMSNorms, one projection, and that single layer. In the config this is one line — mtp_num_hidden_layers: 1 — a single module that predicts one token beyond the next.

The loop: draft $D$ , verify once, accept what matches

Now put the head to work. One decode step becomes three moves. Draft: the cheap MTP head proposes D candidate tokens, one after another (Qwen3.5's default depth is D = 1, though an engine can raise it). Verify: the full model takes the window [last committed token, d₁, …, d_D] and runs it through in one batched forward pass, getting its own opinion at every position at once. Accept: walk the draft left to right and keep each guess that matches what the full model would have picked; at the first mismatch, emit a correction instead and stop — at temperature 0 that correction is simply the full model's top token, while with sampling it is drawn from an adjusted distribution built from both models' probabilities, not literally the target's own pick. Step through it:

Speculative decoding — draft, verify, accept

scenario:

commit

draft

verify

tally

Illustrative tokens — the concept, not this model's real output.

We have committed 5 tokens so far. Plain decoding would now run the full model once to get exactly one more token. Speculative decoding tries to get several.

lossless · same distribution as plain decodebest case: D+1 = 4 tokens / pass · worst case: 1 token + a small draft tax

tokens this pass: 3

passes: 0 · tokens: 0 → — per pass

Measured in this project's native engine: ~1.06–1.46× faster, ~1.3–1.46 tokens per verify pass at temperature 0 — the gain depends on how predictable the text is. Qwen3.5's default draft depth is 1.

Count the payoff. If K of the D drafts are accepted, the step emits K + 1 tokens — the K good guesses plus one token the verify pass produced for free (the correction at the first miss, or a bonus token past the end if all drafts were right). So one pass over the weights yields anywhere from 1 token (everything rejected — the same single full-model pass as plain decoding, plus the small wasted cost of the draft) up to D + 1.

Why it is exactly lossless

Here is the part that makes speculative decoding honest rather than a quality shortcut: at temperature 0, a draft is kept only when it matches the full model's own top pick at that position. With sampling, the accept test is different but just as strict — it compares target and draft probabilities for that exact drafted token and accepts with probability $min (1, p_{target} / p_{draft})$ , resampling from a corrective residual distribution on reject. Either way the verify pass is the real model's real distribution; the MTP head never gets the last word. With the proper acceptance rule — the speculative sampling/decoding algorithm from Leviathan et al. and, independently, Chen et al., both 2023 — the tokens you emit follow the exact same distribution as plain one-at-a-time decoding — the same probabilities at every step and the same temperature behaviour, just produced in fewer passes (at temperature 0 that means the identical text; with sampling, the identical odds). The MTP head never changes the answer — at worst it costs a small drafting tax, at best it saves several passes.

That is also why a wrong guess is cheap: a rejected draft costs only the little bit of MTP compute that produced it. The expensive verify pass was going to happen regardless — it is the same forward pass plain decoding would have run to produce that one token. Speculative decoding is, at worst, plain decoding with a small drafting tax; at best, a multiple of the speed.

Where the idea came from

MTP did not arrive fully formed. Three steps got us here, and the design Qwen3.5 ships is the third:

The MTP lineage — three milestones

Meta 2024parallel

n = 4 parallel heads
independent, no causal chain between them
training: auxiliary task
inference: self-speculative, up to 3× faster
+12% HumanEval / +17% MBPP at 13B

arXiv 2404.19737 ↗

DeepSeek-V3 2024sequential

D = 1 (one extra token)
sequential — keeps the causal chain
shares embedding + output head
train loss λ·(1/D)·ΣLᵏ, λ: 0.3→0.1
inference: discard, or speculative-decode

arXiv 2412.19437 ↗

Qwen3.5 / Qwen3-Next 2025sequential

the model in this course

one shared sequential module, depth 1
mtp_num_hidden_layers: 1
shares the tied embedding (mtp_use_dedicated_embeddings: false)
boosts pretraining efficiency AND inference speed
runs in vLLM / SGLang / this project's native engine

The field moved from parallel independent heads (Meta) to one sequential module that keeps the causal chain (DeepSeek → Qwen).

Meta's 2024 version bolted four parallel heads onto a shared trunk, each predicting a different future token at the same time — great as a training signal, but the heads are independent, so they cannot condition on each other. DeepSeek-V3 changed the shape to a single sequential module that predicts one extra token while keeping the causal chain — the draft of token t+2 gets to see the chosen t+1. Qwen3.5 (built on the Qwen3-Next architecture) adopts exactly that sequential, weight-sharing design: one module, depth one.

It earns its keep twice

The same head pays off at two completely different times. During training, asking each position to predict the next two tokens instead of one is a denser learning signal — more to learn from every token of data. Meta reported their multi-token models solving 12% more HumanEval and 17% more MBPP problems at 13B; DeepSeek-V3 folds an MTP loss into pretraining with a weight that starts at $λ = 0.3$ and drops to $λ = 0.1$ for the final stretch of tokens:

L_{MTP} = \frac{λ}{D} k = 1 \sum D L_{MTP}^{k}

Qwen describes its own MTP as boosting both pretraining efficiency and inference speed. During inference, the very same weights become the free self-drafter from the loop above. One module, two jobs: a better-trained model, and a faster one.

A zoo of drafters

Everything above used the MTP head as the source of the cheap draft. But the draft → verify → accept skeleton doesn't care where the guess comes from — only that it's cheap, and that the verify pass keeps the lossless guarantee. Swap out the drafter and you get a whole family of speculative decoding, all sharing the exact same verify pass you just stepped through.

There's a draft-target setup (a second, smaller model writes the draft — easiest to bolt on, but you now run two models); Medusa (grow a few extra prediction heads on the model itself, no separate model); EAGLE (a purpose-built tiny model fed the target's hidden states — the go-to for general chat); and n-gram / lookahead decoding (no draft model at all — just look up what usually follows, which wins big on code and predictable syntax). The MTP head you just met is the one Qwen3.5 ships. Pick a drafter and compare what changes — and what doesn't:

Pick your drafter — the speculative decoding family

drafter:

you are here · Qwen3.5

Same skeleton every time — only the drafter box changes.

draft source

the model's own built-in MTP head

typical draft length

1 token (Qwen3.5 default depth)

best for

what this course already covered

The model drafts for itself through its built-in MTP head — no second model, no training. This is exactly the draft → verify → accept loop the course already walked through, and the variant this project's native engine actually runs.

true for every drafter

All lossless. At T=0 verify keeps a draft only when it matches the real model's own top pick; with sampling it runs the same probability-ratio accept/resample test instead, so every variant changes speed, never the answer.
Helps most at low batch sizes (spare compute to verify); at high batch it is dynamically disabled — the GPU is already saturated.
Higher temperature → harder-to-predict tokens → lower acceptance.

The in-browser demo runs none of these — it does plain autoregressive decode, one token per pass. The project's native engine runs the MTP variant (~1.06–1.46× there). Read this selector as the menu a production server picks from — n-gram for code, EAGLE for general chat — all sharing the same lossless verify the course already proved.

Three rules hold across the whole zoo. They are all lossless — at temperature 0 verify keeps a draft only when it matches the real model's own top pick, and with sampling it runs the same probability-ratio accept/resample test from above, so every variant changes speed, never the answer. They all help most at low batch sizes, where the GPU has spare compute to run the verify pass for free; at high batch the chip is already saturated (the batching sub-chapter shows why), so servers turn speculation off. And higher temperature makes the next token harder to predict, which lowers the acceptance rate. Speculation buys speed when you're idle and the text is predictable, and quietly steps aside when it isn't.

Does your in-browser Qwen use it?

No — and, as usual, the honest answer is the interesting one. The architecture is real: the config for this exact model says mtp_num_hidden_layers: 1, so a full Qwen3.5 checkpoint does define the head you just saw. But two things keep it off the path in this demo. The bf16 checkpoint the browser downloads was converted without the MTP weights — of its 474 tensors, zero are mtp.* — and the WebGPU decode loop here is plain autoregressive, one token per pass, with no draft/verify machinery at all. (You are not alone: stock Hugging Face Transformers also discards the MTP weights; today it is mainly inference engines like vLLM, SGLang, and this project's own native Metal engine that actually run the loop — measured there at roughly 1.06–1.46× faster, depending on how predictable the text is.)

So multi-token prediction sits with the vision encoder and the other parts of the full checkpoint that the architecture defines but this browser tour does not walk: real in Qwen3.5, switched off here on purpose. You can even see it greyed out in the architecture map — the “MTP head” node, off the route a token actually travels in this demo.