Reasoning & test-time compute

Go deeper · Chapter 11, Sampling — Chapter 15 showed you the <think> pill; here is why making a model “think longer” before it answers works at all, and how it leans on the very sampling knobs from this chapter.

A model answering in one shot spends a fixed amount of compute per token, no matter how hard the question is. Chain-of-thought (CoT) prompting — Wei et al., 2022 — found something almost embarrassingly simple: show a frozen model a few worked examples that write out intermediate steps before the final answer, and it starts writing out its own intermediate steps, getting much better at arithmetic, logic, and multi-step word problems. No weights changed. An 8-exemplar CoT prompt pushed a 540B PaLM model to a new state-of-the-art on GSM8K math problems — beating even a fine-tuned GPT-3 with an external calculator/verifier bolted on.

One chain isn't enough — vote across many

CoT gives you one reasoning chain. Self-consistency (Wang et al., 2022) asks a sharper question: why trust just one? Sample $n$ independent chains at temperature $T > 0$ — literally the sampling temperature from earlier in this chapter — let each one reach its own final answer, and report whichever answer the most chains agree on:

\overset{y}{^} = a arg max i = 1 \sum n 1 [y_{i} = a]

This is a direct, concrete use of this chapter's own machinery: at $T = 0$ (greedy) every “resample” is the identical chain, so there is nothing to vote on — self-consistency needs the randomness that temperature sampling supplies to produce genuinely different chains in the first place. Wang et al. measured real gains from this alone: +17.9 points on GSM8K, +11.0 on SVAMP, +12.2 on AQuA, +6.4 on StrategyQA, +3.9 on ARC-challenge, all over greedy chain-of-thought. Watch it recover the right answer from a mixed batch of samples, including one that makes the exact same mistake greedy decoding would have committed to:

Self-consistency: many samples, one vote

Problem: Roger has 5 tennis balls. He buys 2 more cans of tennis balls, 3 balls per can. How many does he have now?Correct answer: 11

Sample 1

5 already owned + 2×3 bought = 5 + 6 = 11.

✓ 11

Sample 2

2 cans of 3 balls is 6. 6 + the 5 he already had = 11.

✓ 11

Sample 3

5 + 2 = 7 groups of balls, 7 × 3 = 21.

✕ 21

Sample 4

2×3 = 6 new balls. Plus the 5 he had: 11.

✓ 11

Sample 5

2×3 = 5 (slipped). 5 + 5 = 10.

✕ 10

Votes on the final answer

11 — 3 of 5 samplesmajority

21 — 1 of 5 samples

10 — 1 of 5 samples

Majority: 3 of 5 paths land on 11, the correct answer — even though 2 of the 5 individual samples are wrong, and one of them makes the exact same slip as the greedy chain above. Voting over independent samples cancels out an error that any single chain, including the greedy one, would have committed to.

Illustrative scripted example, not a live model run. The five sampled paths are independent draws at T = 0.7 from the exact same prompt — only where the sampler happens to land differs.

From a prompting trick to a trained, scalable axis

Chapter 14's size ladder plotted accuracy against parameter count — bigger model, more train-time compute, better score. In September 2024, OpenAI's o1 put a second axis on the chart: test-time compute. o1 is trained with large-scale reinforcement learning to produce a long internal chain of thought — billed as output tokens but hidden from the response — before answering, and its accuracy climbs log-linearly with both more RL at training time and more thinking tokens at inference time. The jump is large: on AIME 2024, GPT-4o scores around 12% versus o1's 74% with one sample, 83% with a self-consistency style consensus over 64 samples, and 93% when re-ranking 1,000 samples — plus 89th-percentile Codeforces and above-PhD-level GPQA science scores. Snell et al. (2024) formalized why this works: test-time compute is an independent, prompt-adaptive scaling axis, and a compute-optimal way of spending it can match a model $14 \times$ larger on problems within the smaller model's reach, beating naive best-of- $N$ sampling by over 4x in compute efficiency.

Reasoning from pure RL: DeepSeek-R1

DeepSeek-R1 (Jan 2025) pushed the training side further: DeepSeek-R1-Zero learns to reason — long chains, self-reflection, checking its own work — from pure reinforcement learning on top of DeepSeek-V3-Base, with zero supervised warm-start, using GRPO (Group Relative Policy Optimization) and only rule-based accuracy/format rewards, no learned reward model at all. On AIME 2024, pass@1 climbs from 15.6% (the base model) to 71.0% after RL — and to 86.7% with the same self-consistency-style majority vote over 64 samples you just met above, matching OpenAI's o1-0912. The shipped DeepSeek-R1 adds a small cold-start supervised dataset before RL purely to fix readability and language-mixing, reaching 79.8% and o1-parity.

Doing it on a budget: `s1` and budget forcing

You don't need R1-scale RL to get an R1-shaped curve. The s1 paper (Jan 2025) fine-tuned Qwen2.5-32B-Instruct on just 1,000 curated examples (s1K) and added budget forcing: truncate the model's thinking early to force an answer, or — if it tries to stop too soon — suppress the end-of-thinking token and literally append the word “Wait” to make it keep reasoning. That single lever raised AIME24 accuracy from 50% to 57% on the same model, and the resulting s1-32B beat o1-preview on competition math by up to 27 points, no RL required. Notice what budget forcing is actually doing at the token level: directly editing the length and content of the same <think>...</think> block chapter 15 showed you being emitted token by token — test-time compute made literal as “how many tokens do we let the reasoning block run before we force it to stop.”

Does your in-browser Qwen do this?

Partly, and the gaps are worth naming precisely. Qwen3.5's own model card discloses only one vague sentence about its reasoning training — “Scalable RL Generalization: reinforcement learning scaled across million-agent environments with progressively complex task distributions” — with no GRPO name, no reward description, no verifier-pair counts. That is a real step down in disclosure from the predecessor architecture's own Qwen3-Next blog, which published GRPO by name and an exact count — 3,995 query-verifier pairs — for its Reasoning-RL stage. What the card does say plainly: the flagship Qwen3.5-397B-A17B defaults to thinking mode on, but this exact browser's model, Qwen3.5-0.8B, defaults to thinking mode off — the opposite default — and the vendor adds an explicit warning that forcing thinking mode on for this smallest checkpoint makes it prone to reasoning loops that fail to terminate. So the honest picture is: the same <think> machinery from chapter 15 is real and present in this checkpoint, chain of thought and self-consistency are general techniques you could apply to any model including this one by sampling it several times yourself, but the vendor's own advice for the specific 0.8B model running in this tab is to leave thinking off.

Reasoning & test-time compute

One chain isn't enough — vote across many

From a prompting trick to a trained, scalable axis

Reasoning from pure RL: DeepSeek-R1

Doing it on a budget: s1 and budget forcing

Does your in-browser Qwen do this?

Doing it on a budget: `s1` and budget forcing