Evaluation: perplexity, benchmarks, and how much to trust them

Go deeper · Chapter 16, The whole model — the finished model exists now; how do you put a number on how good it is?

Chapter 13 built the training objective from -log p(target): the model's loss is the negative log-probability it assigns the real next token, averaged over every position. That number never left training — it is also, unexponentiated, theonly ingredient evaluation needs. Perplexity is that same loss, exponentiated:

PPL (X) = exp (- \frac{1}{t} i = 1 \sum t lo g p_{θ} (x_{i} ∣ x_{< i})) = exp (cross-entropy loss)

which the Hugging Face Transformers docs state is exactly “the exponentiation of the cross-entropy between the data and model predictions.” It has a clean reading: perplexity is the size of the vocabulary the model is, on average, as confused as if it were guessing uniformly among that many equally-likely options. Guess among 2 options with total confidence and perplexity is 1; be utterly lost across the whole vocabulary and perplexity approaches the vocabulary size itself.

This exact model's own number, exponentiated

Chapter 13 already worked this out: at initialization every one of this model's 248,320 vocabulary tokens gets roughly the same probability, so the loss starts at −ln(1/248,320) = ln(248,320) ≈ 12.4 nats. Exponentiate that and you get back exactly 248,320 — the vocabulary size itself. That is not a coincidence: the perplexity of a perfectly uniform guess over V outcomes is always exactly V, because cross-entropy over a uniform distribution is ln V and exp(ln V) = V. So before a single gradient step, this exact model's perplexity over its own vocabulary is ≈ 248,320 — and every nat the loss drops during training is this number shrinking. Two real, published anchors show where a well-trained model ends up: GPT-3 175B reached a perplexity of 20.50 on Penn Treebank (versus a prior GPT-2-era best of 35.8), and 1.92 on LAMBADA next-word prediction with few-shot prompting (versus a prior best of 8.63) — both from the original GPT-3 paper. GPT-2's second-largest model (762M parameters — 1.5B is the largest) reported a WikiText-2 perplexity of 19.93. Same formula throughout the ladder below — from this model's own ≈248,320 at initialization down to GPT-3's 1.92, five orders of magnitude of improvement:

Perplexity compresses as loss drops (log scale)

confident (low perplexity)confused (high perplexity)

Perplexity = exp(cross-entropy loss). A uniform guess over V outcomes always has perplexity exactly V — so this model’s own ln(248,320) ≈ 12.4-nat init loss from Chapter 13 IS this top rung, just exponentiated. Every nat the loss drops during training is this bar shrinking.

Standard benchmarks: fixed questions, fixed answers

Perplexity is cheap to compute and needs no held-out task design, but it only measures how well the model predicts generic text — it says nothing about whether the model can answer a trivia question, write working code, or solve a math problem. For that, the field leans on benchmarks: fixed sets of questions with fixed, checkable answers.

MMLU (Massive Multitask Language Understanding, Hendrycks et al. 2020) is 57 school- and professional-level subjects — history, law, medicine, physics, and more — packaged as 15,908 four-way multiple-choice questions. It is scored as plain multiple-choice accuracy against a 25% random-guess floor; GPT-3 175B reached 43.9% few-shot, while smaller GPT-3 sizes barely cleared chance (24.9–26.0%).

HumanEval (Chen et al. 2021, the paper that introduced OpenAI Codex) is 164 hand-written Python problems: given a function signature and a docstring, complete the body, then run real unit tests against it (7.7 tests per problem on average). Because sampling is random, a model gets k attempts and the metric is pass@k — the fraction of problems solved by at least one of k samples, computed with an unbiased estimator so it doesn't just reward generating more samples:

pass@ k := E_{Problems} [1 - \frac{( k n - c )}{( k n )}]

where n samples are drawn per problem and c of them pass every test. Codex reached pass@1 = 28.8% and pass@100 = 70.2% — while plain GPT-3, with no code fine-tuning, solved 0%.

GSM8K (Cobbe et al. 2021) is 8.5K grade-school word problems (7.5K for training, 1K held out for testing) — “Natalia sold clips to 48 friends in April, and half as many in May…” — graded by exact match on the final numeric answer, not pass@k: there is no code to run, just one number to check. The paper's headline result is about a verifier, not raw scale: a 6B model fine-tuned to solve problems, paired with a second 6B model trained to verify solutions and re-rank 100 sampled candidates, slightly outperforms a fine-tuned 175B model sampling once with no verifier at all — a gain the paper describes as “approximately equivalent to a 30x model size increase.”

LLM-as-judge: when there is no single right answer

None of the above works for open-ended output — “write me a birthday message” has no fixed correct string to match. Zheng et al. (2023) proposed the fix the industry now leans on: have a strong LLM (they used GPT-4) read two candidate responses and judge which is better, the same way a human rater would. Their MT-Bench (a fixed set of multi-turn questions) and Chatbot Arena (crowdsourced head-to-head battles) are built on exactly this idea, and GPT-4-as-judge reached >80% agreement with human preferences — comparable to how often two humans agree with each other.

The paper is equally clear about where an LLM judge goes wrong, naming three specific biases: position bias (favoring whichever response is shown first), verbosity bias (favoring the longer answer, regardless of whether it's better), and self-enhancement bias (a judge favoring answers that read like its own family's style). LLM-as-judge is powerful precisely because it scales to open-ended text that fixed-answer benchmarks can't touch — but it inherits a model's own blind spots as the price of that flexibility.

Contamination: why none of this is trustworthy at face value

Every benchmark above is public text on the internet, and every modern model is pretrained on a huge crawl of the internet. If a benchmark's questions — or worse, its answers — leaked into pretraining data, a model can score well by having memorized the test, not by reasoning its way to the answer. This is contamination, and catching it has gotten stricter over time.

GPT-3's own paper used a fairly loose check: 13-gram overlap between an eval example and the training crawl. By that method, it flagged over 90% of QuAC, SQuADv2, and DROP examples as contaminated — a striking number that shows just how much public benchmark text ends up baked into a large-enough web crawl. GPT-4's technical report tightened the bar to a 50-character substring check: strip spaces and symbols from both eval and training text, then see if any of three random 50-character chunks from an eval example shows up verbatim in training data. That stricter test found real leakage — OpenAI re-scored exams with the contaminated questions removed, and dropped an entire benchmark, BIG-bench, from its reported results once contamination in it was confirmed. A newer method skips the training corpus altogether: Min-K% Prob (Shi et al. 2023) looks only at the model itself, flagging a sequence as likely-memorized when its least-probable tokens are still suspiciously high-probability — a genuinely unseen example should contain a few real surprises, and a memorized one usually doesn't.

So read every benchmark number in this sub-chapter, and every benchmark number you see anywhere, with that asterisk attached: it is a real, checkable score on a real test — and also a number that could be partly measuring memorization rather than capability, to a degree nobody outside the lab that trained the model can fully audit.

Does your in-browser Qwen have benchmark numbers?

Partly. Qwen3.5-0.8B's official Hugging Face model card publishes real, dated scores across a family of knowledge and reasoning benchmarks — MMLU-Pro, MMLU-Redux, C-Eval, SuperGPQA, GPQA, and IFEval among them — most reported twice (once for the model's default non-thinking mode and once for thinking mode, the <think> block from the post-training chapter), with one exception: GPQA, where the card reports only a thinking-mode score. What it does not publish, for this exact 0.8B checkpoint, is a GSM8K score, a HumanEval score, or a classic (non-Pro, non-Redux) MMLU score — and a direct search turns up no independent number for any of those three, either. Here is the card's own table, gaps included:

What Qwen3.5-0.8B's own model card actually reports

non-thinkingthinking

MMLU-Pro

29.7

42.3

MMLU-Redux

48.5

59.5

C-Eval

46.4

50.5

SuperGPQA

16.9

21.3

GPQA

not reported for this mode

—

11.9

IFEval

52.1

44.0

GSM8Knot published for this checkpoint

HumanEvalnot published for this checkpoint

MMLU (classic)not published for this checkpoint

Read the rows literally, including the odd one: on IFEval the card reports non-thinking scoring higher than thinking (52.1 vs 44.0) — thinking mode is not a strict upgrade on every metric, and the card does not smooth that over. And three benchmarks you might expect — GSM8K, HumanEval, and classic MMLU — simply are not in the table for this exact checkpoint. Not every model gets evaluated on every benchmark; that gap is honest data, not a hidden bad score.

Source: huggingface.co/Qwen/Qwen3.5-0.8B — read directly, nothing extrapolated.

That gap is the honest answer, not a gap this course is going to paper over with an invented number. Not every checkpoint gets run through every benchmark — smaller, faster-shipping models in a family are frequently evaluated on a narrower slice than the flagship release — and reporting “not measured” is a more trustworthy state than reporting a number nobody checked.