Pretraining data curation

Go deeper · Chapter 13, Training & teacher forcing — the last chapter said “trillions of tokens of generic web text, books, and code.” Here is what actually goes into that clause.

The training objective from the last chapter — minimize -log p(next token) at every position — is the same one line for every modern LLM. What differs, and what quietly does most of the work, is what text you feed it. Scrape the raw web and train on it as-is and you get a worse model than if you first clean it — not because the objective changes, but because raw web text is full of problems the objective cannot see past on its own: it is riddled with near-duplicates, much of it is boilerplate or spam, and it is not sourced or mixed the way a curated corpus is. This sub-chapter unpacks the three pillars behind that one glossed-over clause: deduplication, quality filtering, and sourcing & mixture.

Deduplication: the web repeats itself more than you'd think

Common Crawl and similar web scrapes are not a clean sample of unique documents. The same article gets syndicated across dozens of news mirrors, the same boilerplate legal notice appears on thousands of unrelated sites, and templated pages differ by only a sentence. Lee, Ippolito, Nystrom, Zhang, Eck, Callison-Burch, and Carlini measured exactly how much of this survives into standard training corpora (Deduplicating Training Data Makes Language Models Better, 2021): 3.04% of C4, 13.63% of RealNews, 4.86% of LM1B, and 0.39% of Wiki-40B are near-duplicate documents. In one extreme case, a single 61-word sentence in C4 was found repeated over 60,000 times.

Why bother removing it? Because a model trained on heavily-duplicated text doesn't just waste compute re-learning the same sentence — it starts to memorize it. Lee et al. found that deduplication cut how often a trained model emits verbatim memorized training text by roughly 10×, while also shrinking the dataset by up to 19% and reducing accidental train/test overlap (which had been inflating benchmark scores on over 4% of standard validation sets).

The paper uses two complementary techniques. Exact-substring deduplication builds a suffix array over the entire corpus and removes any verbatim span of 50+ BPE tokens that occurs in more than one document — catches copy-pasted boilerplate and quoted blocks. Near-duplicate deduplication (“NearDup”) is the harder problem: two documents that are almost the same but not byte-identical (a reprinted article with a different headline, a templated page with one field changed). Lee et al.'s tool is MinHash: represent each document as a set of overlapping 5-grams, hash that set into a compact signature (here, 9,000 hash values split into 20 bands of 450), and flag two documents as near-duplicates when enough of their signature matches — corresponding to a Jaccard similarity around $J (A, B) = \frac{∣ A \cap B ∣}{∣ A \cup B ∣} ≳ 0.8$ . MinHash-based LSH (locality-sensitive hashing) is now the industry-standard way to deduplicate web-scale text — reused, with its own tuned parameters, by essentially every modern open pipeline, FineWeb included.

Quality filtering: two classic recipes

Deduplication removes repeated text; quality filtering removes text that is unique but bad — link farms, keyword-stuffed spam, auto-generated boilerplate, pages that are mostly navigation menus. Two approaches have dominated for years, and both are still in active use today.

Classifier-based filtering trains a small model to distinguish “text that looks like the good stuff” from raw web crawl. GPT-3's paper describes the canonical version: a logistic-regression classifier trained with curated corpora (WebText, Wikipedia, a books corpus) as positive examples against raw Common Crawl as negative examples, then used to resample Common Crawl toward documents that score more Wikipedia/book-like. The same idea resurfaces two years later as FineWeb-Edu: Llama-3-70B-Instruct was prompted to rate 500k web pages 0-5 for educational value, and a lightweight classifier was distilled from those ratings to score FineWeb's full 15T-token corpus at scale — producing a 1.3T-token subset that substantially outperforms the unfiltered corpus on knowledge-heavy benchmarks like MMLU.

Perplexity-based filtering takes a different, cheaper approach: score every paragraph by how “surprising” it looks to a small language model trained on known-good text, and discard the high-perplexity (surprising, likely boilerplate-or-spam) tail. CCNet is the widely-cited example — a 5-gram KenLM model trained per-language on that language's Wikipedia, used to score crawled paragraphs and keep the low-perplexity, Wikipedia-like ones, later reused as a filtering stage in RedPajama and Dolma. (This project could only confirm CCNet's method via secondary sources describing the paper, not the original paper's text directly — worth flagging, since every other claim on this page traces to a primary source we fetched ourselves.)

Sourcing & mixture: Common Crawl is the base, not the whole recipe

Nearly every open pretraining corpus starts from the same place: Common Crawl, a public, continuously-updated scrape of the web, released as monthly snapshots. What separates a raw Common Crawl dump from a usable pretraining dataset is exactly the pipeline above — and FineWeb (Penedo et al., 2024) is the cleanest fully-documented worked example: URL blocklist filtering, text extraction (via the trafilatura library), fastText language identification, Gopher/C4-style plus custom heuristic quality filters, and finally MinHash near-duplicate dedup — turning 96 raw Common Crawl snapshots into a 15-trillion-token curated English dataset. Toggle to the “Mixture” view below to see the second half of the sourcing story: labs don't train on web text alone, they deliberately blend it with books, code, and reference text in fixed, non-uniform ratios — LLaMA's published mixture table is the clearest cited example, up-sampling scarce high-quality sources (Wikipedia, books, ArXiv) across multiple epochs while running the much larger Common Crawl slice for barely more than one.

From raw web crawl to a curated pretraining set

A real, documented pipeline (FineWeb) that turns raw Common Crawl into a curated text dataset. Only the two endpoints — 96 snapshots in, the final token counts out — are measured numbers; the bar width at every stage in between is an illustrative taper, not a published per-stage yield.

Raw Common Crawl

96 monthly web-crawl snapshots — the base layer under almost every open pretraining corpus

96 snapshots · measured / cited

URL filtering

Drop domains on an adult-content blocklist, before any text is even pulled out

Extraction + language ID

trafilatura extracts text from raw HTML; a fastText classifier keeps only English pages scoring ≥ 0.65

Quality heuristics

Gopher/C4-style + custom filters — drop boilerplate, cookie-notice lines, pages that are mostly unpunctuated or duplicated lines

MinHash dedup

5-grams → 112 hashes split into 14 buckets of 8 → drop documents that are ≥ 75% similar to one already kept

FineWeb (final)

The curated, deduplicated English web-text dataset

15T tokens · measured / cited

+ classifier filter → FineWeb-Edu

A Llama-3-70B-distilled educational-value scorer keeps only the highest-scoring subset

1.3T tokens · measured / cited

illustrative taper, not a measured yield

Why dedup matters — measured, not illustrative

Lee et al. 2021 measured how much of a raw web corpus is near-duplicate text before any dedup runs. It is a lot:

RealNews13.63%

LM1B4.86%

C43.04%

Wiki-40B0.39%

Training on the undeduplicated version made models emit memorized training text roughly 10× more often — one 61-word C4 sentence was found duplicated over 60,000 times.

Read the pipeline view as a taper, not a spreadsheet: only the entry (96 snapshots) and the two exits (15T tokens for FineWeb, 1.3T for the classifier-filtered FineWeb-Edu subset) are numbers FineWeb actually publishes. The dedup statistics underneath it, by contrast, are real measurements from Lee et al. — shown on their own row so the illustrative taper and the measured facts are never mistaken for each other.

Does your in-browser Qwen tell you how its training data was cleaned?

No — and this time the honest answer is less about this specific browser demo and more about the entire industry it comes from. The Qwen3 Technical Report does disclose real numbers about scale: ≈36 trillion tokens across three pretraining stages (Stage 1, general, >30T tokens; Stage 2, reasoning/knowledge boost, +≈5T more; Stage 3, long-context, hundreds of billions more at 32,768-token sequences), plus broad category labels — STEM, code, books, multilingual text, and synthetic data extracted and generated with the help of Qwen2.5-VL (for OCR'ing PDF-like documents), Qwen2.5-Math, and Qwen2.5-Coder. What it does not disclose is any of the machinery on this page: no deduplication method or tooling is named, no quality-filtering methodology (classifier or perplexity) is described, and there is no percentage breakdown of web-vs-books-vs-code-vs-academic composition — nothing resembling the LLaMA table above.

It gets murkier still for the exact model this course runs. As of 2026-07-01, no dedicated Qwen3.5 or Qwen3-Next technical report — covering the plain hybrid text/GatedDeltaNet line this browser tab actually runs, as opposed to the separate multimodal line — could be found on arXiv. Two 2026 Qwen papers exist, and neither closes that gap: the Qwen3.5-Omni Technical Report covers an architecturally different omnimodal (text+vision+audio) model and discloses data scale (100M+ hours of audio-visual data) but, again, zero curation methodology; the earlier Qwen3-Coder-Next Technical Report (arXiv:2603.00729, Feb 2026) builds on a Qwen3-Next base but is scoped to agentic coding capability — mid-training and RL on top of an existing base, not a pretraining-data disclosure. Neither is a guess dressed up as a fact: this is the honest result of actually searching and coming up empty.

An independent audit backs this up. Stanford CRFM's Foundation Model Transparency Index (December 2025 company report) scores Alibaba/Qwen at 0 — not disclosed — on methods for acquiring pretraining data, public dataset sources used, crawler identity and robots.txt opt-out handling, licensed-data specifics, and language/domain composition percentages. That is not a Qwen-specific failing: OpenAI, Anthropic, Google, and Meta withhold the same source-level detail for their own frontier releases. So the honest picture is this: everything on this page above the fold — dedup, quality filtering, Common Crawl, FineWeb, LLaMA's mixture — is real, documented, and verifiable, because it comes from labs and papers that chose to publish it. The recipe actually used to clean the training corpus behind the weights running in your browser right now — undisclosed for this exact checkpoint, though likely a similar order of magnitude to the ~36 trillion tokens Qwen3 discloses — is, like almost every other frontier model's, proprietary.