Knowledge distillation

Go deeper · Chapter 15, From base model to assistant — training a small model to imitate a bigger one, instead of (or in addition to) imitating humans.

Same recipe, a different source for the label

The post-training chapter described instruction tuning (SFT) as the same next-token loss the model has always used, just pointed at a small, curated dataset of (instruction, response) pairs — usually written or vetted by humans. Knowledge distillation keeps that exact setup and changes one thing: instead of a human writing the response, a bigger, more capable teacher model writes it (or, in the classic version below, supplies the whole target distribution). The student — the model actually being trained, and usually much smaller than the teacher — learns to imitate the teacher instead of, or in addition to, a human annotator.

It's still exactly the loss the training chapter introduced: $- lo g p (target token)$ , averaged over every scored position — at initialization that's $ln (248, 320) \approx 12.4$ nats, uniform guessing over this model's vocabulary. Distillation never touches that formula. It only changes what counts as the target.

The classic recipe: soft targets and "dark knowledge"

The technique is older than the LLM era — Geoffrey Hinton, Oriol Vinyals, and Jeff Dean introduced it in 2015 under the name distillation (building on an earlier idea from Rich Caruana and collaborators, who matched raw logits directly — a special case of Hinton's method in the high-temperature limit, as the paper proves). Their insight: a trained classifier's softmax assigns some probability to every class, even the wrong ones, and those small probabilities are not noise — they encode how the model tends to generalize. Their own example, quoted directly:

"An image of a BMW, for example, may only have a very small chance of being mistaken for a garbage truck, but that mistake is still many times more probable than mistaking it for a carrot." — Hinton, Vinyals & Dean, Distilling the Knowledge in a Neural Network, arXiv:1503.02531, §1

A one-hot label ("this is a BMW") throws that pattern away entirely. To keep it, soften the teacher's softmax with a temperature $T$ :

q_{i} = \frac{e ^{z_{i} / T}}{\sum _{j} e ^{z_{j} / T}}

At $T = 1$ this is ordinary softmax; raising $T$ flattens the distribution so the near-miss (garbage truck) and the far-miss (carrot) separate from each other instead of both rounding to zero. The student is then trained on a weighted mix of two cross-entropies: one against the teacher's soft targets at that same high $T$ , and a lighter one against the ordinary hard label at $T = 1$ . One detail worth keeping: because the soft-target term's gradient shrinks like $1/ T^{2}$ , Hinton's paper multiplies that term by $T^{2}$ so its contribution doesn't quietly vanish as $T$ is raised.

The results were not subtle. On MNIST, a small 800-unit network trained the ordinary way made 146 test errors; the identical network trained only to match a larger, dropout-regularized network's soft targets (at $T = 20$ ) made 74 — roughly half. On a commercial speech model (~85M parameters, 2,000 hours of audio), a 10-model ensemble improved frame accuracy from 58.9% to 61.1%; a single model distilled from that ensemble reached 60.8% — the paper reports this captured "more than 80%" of the ensemble's gain, in one model instead of ten.

The modern recipe: just SFT, with a smarter source of text

Today's largest LLM distillation effort drops the soft-distribution machinery entirely. DeepSeek's R1 paper distills its own long chain-of-thought reasoning into six much smaller dense models (Qwen2.5-Math-1.5B/7B, Qwen2.5-14B/32B, Llama-3.1-8B, Llama-3.3-70B) using 800,000 (prompt, response) pairs where every response was generated by DeepSeek-R1 itself (about 600K reasoning traces plus 200K general-purpose examples). The method, stated verbatim in the paper's appendix: "we apply only SFT and do not include an RL stage." No temperature, no KL divergence, no teacher logits at all — the teacher's generated text is simply treated as an ordinary SFT target, exactly the recipe from the top of this page.

The headline result argues distillation isn't just cheaper than training a small reasoner from scratch — it can be better. Directly reinforcement-learning a Qwen2.5-32B base model with no distillation ("Qwen2.5-32B- Zero") reaches AIME 2024 pass@1 of 47.0. Simply fine-tuning that same-size base model on DeepSeek-R1's generated responses — plain SFT, no RL at all — reaches 72.6. Even the smallest distilled model, DeepSeek-R1-Distill-Qwen-1.5B, beats GPT-4o and Claude-3.5-Sonnet on that benchmark. Compare both eras side by side:

Two eras of distillation, same teacher → student idea

Both trainings below fine-tune a small STUDENT model on a big TEACHER model’s output instead of (or in addition to) a human-written label. What differs is what counts as the "label."

Classic · Hinton, Vinyals & Dean, 2015

Soft targets at a shared temperature

Soften the teacher’s softmax with a temperature T, then train the student to match that SAME softened distribution at that SAME T (plus a lighter cross-entropy on the true hard label at T = 1). Drag T and watch the wrong-answer probabilities grow from vanishing to visible:

Temperature TT = 6

BMW (correct)

99.7%

garbage truck

0.292%

carrot

0.007%

sofa

0.003%

“An image of a BMW, for example, may only have a very small chance of being mistaken for a garbage truck, but that mistake is still many times more probable than mistaking it for a carrot.”

— Hinton, Vinyals & Dean, “Distilling the Knowledge in a Neural Network,” arXiv:1503.02531, §1

Logits (60, 25, 3, −3) are invented for illustration — the paper describes this example qualitatively and publishes no logit values for it. At T = 1 the distribution is nearly one-hot on BMW, and the wrong answers are pinned near zero — too small to carry a learnable signal. Raising T inflates them to a visible, comparable size, exposing the relative structure among them — the "dark knowledge" the student also learns to imitate — even as the softening pulls every probability, right answer included, toward the uniform distribution. Because the soft-target gradient scales as 1/T², Hinton’s loss multiplies that term by T² so its weight stays constant as T changes.

Modern · DeepSeek-R1-Distill, 2025

Plain SFT on teacher-written text

No soft distribution, no KL divergence at all — just ordinary next-token cross-entropy, the same loss the training chapter introduced, on (prompt, response) pairs where the RESPONSE was generated by the teacher instead of a human annotator.

Prompt: "Prove that √2 is irrational."

DeepSeek-R1’s response → SFT target for the student:

<think> Assume, for contradiction, that √2 = p/q in lowest terms… </think>

So √2 cannot be rational. ∎

Every one of these tokens gets an ordinary −log p(token) loss — exactly like SFT, just with a bigger, smarter source of training text.

AIME 2024 pass@1: distillation vs. direct RL, same size

Qwen2.5-32B-Zero47.0

direct RL, no distillation

DeepSeek-R1-Distill-Qwen-32B72.6

SFT-distilled from R1, no RL stage

DeepSeek-R1 paper, Appendix F.1 (arXiv:2501.12948): “we apply only SFT and do not include an RL stage” for the distilled models.

Both are called "distillation" for the same reason — a smaller student learns from a bigger teacher instead of only from humans. What changes across the two eras is exactly one thing: whether the target is a temperature-softened probability distribution, or the teacher’s own generated text treated as an ordinary SFT example.

Not mutually exclusive: Qwen's own strong-to-weak distillation

The two recipes above aren't rivals — Qwen's own (prior-generation) Qwen3 Technical Report documents a pipeline that uses both, one after the other, for Qwen3's small dense and MoE models. First an off-policy phase: plain SFT on responses generated by a larger teacher (Qwen3-32B or Qwen3-235B-A22B) — the modern recipe above. Then an on-policy phase: the small student generates its own rollouts, and is fine-tuned to minimize the KL divergence between its own logits and the teacher's — literally Hinton's 2015 method, just run on the student's own samples instead of a fixed dataset. Qwen reports this reached higher benchmark scores than their full four-stage training pipeline while using "only 1/10 of the GPU hours." It's a real, documented example of exactly the classic-vs-modern split this page just walked through, coexisting in one training run.

Does your in-browser Qwen do this?

Unconfirmed — and worth saying plainly rather than assuming it by analogy. The paragraph above is about Qwen3, the previous generation. For Qwen3.5 — the Qwen3-Next-based family this course actually teaches, including the 0.8B model running in this tab — no official source turned up any mention of distillation. The Qwen3.5-0.8B model card, the QwenLM/Qwen3.5 GitHub README, and the flagship qwen.ai announcement all describe this family's training as pretraining followed by "Scaled Reinforcement Learning" — reinforcement learning run across large numbers of agent environments — with no teacher model, no "strong-to-weak" pipeline, and no distillation mentioned anywhere for Qwen3.5 specifically. So the honest answer is: Qwen has publicly documented doing exactly what this page describes, for its previous generation's small models. Whether the same trick, or some variant of it, touched this specific 0.8B checkpoint is not something any public source confirms — so this course does not claim it.