Scaling laws: how many tokens for how many parameters?

Go deeper · Chapter 14, Scaling & regularization — the chapter's ladder counted parameters; this is the other half of the compute budget: how much data those parameters should actually see.

The missing axis

The parameter ladder earlier in this chapter put this model, GPT-2 XL, and GPT-3 on one log axis of raw parameter count. That answers “how big is the model,” but a model in isolation is not a recipe — training it also means choosing how many tokens to run it over. Get that second number wrong and a bigger model can end up worse than a smaller one trained properly, for the same compute bill. Two research groups, two years apart, gave opposite answers to “how many tokens, given a fixed compute budget?” — and the correction between them is one of the most consequential results in LLM training history.

Kaplan 2020: grow the model, not the data

In 2020, OpenAI's Kaplan et al. (“Scaling Laws for Neural Language Models,” arXiv:2001.08361) ran a huge sweep of models and found that loss falls off as a clean power law in each of three things taken separately — parameters N, training tokens D, and compute C:

L (N) = (\frac{N _{c}}{N})^{α_{N}}, α_{N} \approx 0.076 L (D) = (\frac{D _{c}}{D})^{α_{D}}, α_{D} \approx 0.095

Bigger exponent means the loss is more sensitive to that axis, so at first glance data (0.095) looks like the better lever than parameters (0.076). But that is not how Kaplan et al. read their own result. Fitting a third curve — loss versus compute, C — and asking how to split a fixed compute budget between a bigger model and more data, they found the optimal split is lopsided:

N_{optimal} \propto C^{0.73}

For every $10 \times$ increase in compute, model size should grow by $1 0^{0.73} \approx 5.4 \times$ , while the data (and the number of training steps) grows far more modestly — the leftover exponent is only 0.27. Their own words for the takeaway: train very large models on a relatively modest amount of data, and stop well before convergence. Given a fixed compute budget, undertrain a big model rather than fully train a small one. This was the field's working assumption for nearly two years — it is a large part of why models like GPT-3 (175B parameters) were trained on data corpora that, in hindsight, were too small for their size.

Chinchilla 2022: grow them equally

In 2022, DeepMind's Hoffmann et al. (“Training Compute-Optimal Large Language Models,” arXiv:2203.15556 — the paper that introduced Chinchilla) reran the experiment at far larger scale: 400+ models, from 70M to 16B parameters, on 5B to 500B tokens. Their correction to Kaplan's conclusion, quoted directly from the abstract:

“for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.”

In symbols, both grow at the same rate with compute — N ∝ C^0.5 and D ∝ C^0.5 — not Kaplan's model-heavy C^0.73 / C^0.27 split. To prove the point, they built Chinchilla: 70B parameters trained on 1.4 trillion tokens, using the exact same training compute as DeepMind's own earlier Gopher model — 280B parameters, but only 300B tokens. Same compute bill, radically different split. Chinchilla won, decisively — for instance scoring 67.5% average on MMLU versus Gopher's lower score at four times the parameter count. Bigger was not better; better-fed was better.

Do the arithmetic on Chinchilla's own headline numbers and a famous ratio falls out:

\frac{1.4 \times 1 0 ^{12} tokens}{70 \times 1 0 ^{9} params} = 20 tokens per parameter

That is not a sentence quoted from the paper — it is the ratio implied by Chinchilla's own reported 70B/1.4T figures, and it is how the result is near-universally cited: train on roughly 20 tokens for every parameter. Gopher, for comparison, sat at only 300e9 / 280e9 ≈ 1.07 tokens/parameter — badly under-fed relative to its size, exactly Kaplan's prescription and exactly what Chinchilla showed was a mistake.

Doing the arithmetic for this course's model

This model has 852,985,920 parameters. If it were trained at the Chinchilla-optimal ~20:1 ratio, the token budget would be:

852, 985, 920 \times 20 \approx 17.06 \times 1 0^{9} tokens \approx 17.06 B tokens

Read that number carefully: it is this section's own multiplication, not a disclosed fact about this checkpoint. No primary source — not a Qwen blog post, not a technical report, not this repo's config — states how many tokens Qwen3.5-0.8B was actually pretrained on. The only concrete, verifiable token count anywhere in the Qwen family is for a different, earlier model generation: the Qwen3 Technical Report (arXiv:2505.09388) states the (much larger, non-hybrid) Qwen3 family was pretrained on roughly 36 trillion tokens. That figure belongs to a different model and cannot be assumed to transfer to this 0.8B hybrid checkpoint — it is shown here only so you know what is and is not public.

Real models overtrain on purpose — Llama 3

Chinchilla answers “what minimizes training compute for a target loss.” It does not answer “what minimizes cost once the model is deployed and answering millions of queries a day.” Meta's Llama 3 (“The Llama 3 Herd of Models,” arXiv:2407.21783) makes that second objective explicit, in its own words:

“While our scaling laws suggest our flagship model is an approximately compute-optimal size for our training budget, we also train our smaller models for much longer than is compute-optimal. The resulting models perform better than compute-optimal models at the same inference budget.”

Do the same tokens-per-parameter arithmetic on Llama 3 8B, pretrained on “over 15T tokens” per Meta's own release announcement:

\frac{15 \times 1 0 ^{12} tokens}{8 \times 1 0 ^{9} params} = 1, 875 tokens per parameter

Chinchilla-optimal for an 8B model would be 8e9 × 20 ≈ 160B tokens. Llama 3 8B used nearly a hundred times that: 1,875 / 20 ≈ 94× the Chinchilla-optimal ratio — trained far past the point of minimizing training compute, on purpose, because a smaller model that answers a query slightly better per FLOP is enormously cheaper to serve at scale than a larger, more “compute-optimal” one would be. Contrast that with the flagship: the 405B model was pretrained on 15.6T tokens at 3.8×10²⁵ FLOPs, and the paper's own scaling-law fit predicted an optimum of 402B params / 16.55T tokens for that exact compute budget (≈41 tokens/param at that scale) — the flagship landed almost exactly where its own math said to land, while the smaller siblings were deliberately pushed far beyond theirs.

Training tokens vs. parameters (log-log)

published, measured training runthis course's model, hypothetical — not a verified figure

Chinchilla itself sits exactly on the reference line by construction. Gopher, matched to the same training compute but given far less data, falls well below it. Llama 3 8B sits far above it on purpose — see below. Llama 3 405B lands just above it, close to the ratio its own scaling-law fit actually predicted for that compute budget.

Does your in-browser Qwen follow either recipe?

Unknown — and that is the honest answer, not a dodge. The theory above is real and independently verified from primary sources: Kaplan's power laws and model-heavy split, Chinchilla's equal-growth correction and its ~20:1 ratio, Llama 3's deliberate overtraining for cheaper inference. What is not verifiable is which of these recipes — or something else entirely — actually produced this specific 852,985,920-parameter checkpoint. Chapter 13 already said this model saw “trillions of tokens of generic web text, books, and code” during pretraining — deliberately without a precise count, because none is public. This sub-chapter cannot responsibly put a real number where Qwen has not published one. The 17.06B-token figure above is a hypothetical, clearly marked as such in the chart — useful for understanding what “Chinchilla-optimal at this size” would even mean, not a fact about this model's training run.