Chapter 2 · Tokenization

Explain why an LLM splits text into sub-word tokens instead of characters or words.

Models need a finite vocabulary that balances sequence length against vocabulary size.

7 min
forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling
Glossary · 6 terms
token
The unit a transformer actually reads — an integer id pointing into the model's vocabulary.
BPE
Byte-Pair Encoding: start from bytes, repeatedly merge the most frequent adjacent pair until the vocabulary is the target size.
sub-word
A token that is smaller than a word but larger than a character, e.g. "ization" or " the".
vocabulary
The full set of tokens the model knows. Qwen3.5 ships about 248,320 entries.
special token
A reserved token like <|im_start|> that marks structure (chat turns, end-of-text) rather than literal text.
chars/token
Characters divided by tokens — a quick proxy for how efficiently a string fits the tokenizer's vocabulary.

Tokenization: turning text into the model's vocabulary

Before a transformer can do anything with your sentence, the sentence has to be cut into pieces the model knows. Those pieces are called tokens, and the rule that decides where the cuts go is called the tokenizer. Tokens are not characters, and they are not words — they are something in between, chosen by an algorithm called Byte-Pair Encoding (BPE). The widget on the right runs Qwen3.5's exact tokenizer in your browser. Type, watch the chips refresh.

Why sub-words?

Tokenizers sit in an awkward middle. If we made every word a token, the vocabulary would need millions of entries to cover the long tail of names, typos and made-up words — and the embedding matrix (the big lookup table holding one vector per vocabulary entry — next chapter) would balloon along with it. If we went the other way and made every character a token, the model would have to chew through sequences five-to-ten times longer than the same text in words, and every layer's cost is at least linear in sequence length.

One fixed sentence makes the trade-off concrete. Take "The quick brown fox jumps over the lazy dog": as characters it is 43 tokens; as whole words, 9; a sub-word tokenizer lands at exactly 9 — as short as words, but built from a vocabulary that never runs out of entries.

BPE finds a middle path: start from raw bytes (or characters), then repeatedly merge the most frequent adjacent pair into a new symbol until the vocabulary is the size you want. Common words like the end up as a single token; rarer words like Antidisestablishmentarianism decompose into shorter familiar pieces. The tokenizer doesn't know what a word means — it only knows what byte sequences are statistically frequent in the training corpus. Because that base alphabet is the 256 possible byte values, the tokenizer can represent literally any input — any Unicode character, emoji, or script, even ones it never saw in training — so there is no "unknown" or out-of-vocabulary token; unfamiliar text just breaks into more, smaller (byte) tokens.

What Qwen3.5 does specifically

Qwen3.5 ships with a BPE tokenizer of roughly 248,320 entries. A few quirks are worth knowing:

  • Spaces live inside tokens. The BPE algorithm treats a space-prefixed word as a different symbol from the bare word, so "cat" at the start of a sentence and " cat" mid-sentence are usually two different tokens. The chips on the right mark a leading space with a small dot (·) so you can see this without it being invisible.
  • Special tokens mark structure. Sequences like <|im_start|> and <|im_end|> aren't text the model generates — they're chat-turn boundaries used by Qwen3.5's chat template. The widget renders any token whose decoded text contains <|…|> with a dashed border so you can spot them.
  • Multilingual coverage isn't uniform. Latin-script words are usually one or two tokens; CJK characters often land as one token each but unusual scripts can decompose into byte-level fragments. Try the multilingual example to see the difference.

Why this matters

Every downstream cost a language model has is paid per token: API billing, context-window limits (the cap on how many tokens the model can hold at once), KV-cache memory (the per-token working memory kept during generation — later chapter), and the wall-clock latency of generation. A prompt that looks short to a human can be long to the model if it uses rare punctuation, code, or languages the tokenizer didn't see much of in training. The characters-per-token ratio in the stats footer is the simplest way to feel this: prose in well-supported languages usually lands near 4 chars/token, code closer to 2-3, and an emoji-heavy or rare-script string can drop below 1.

Different model families use different tokenizers, and the same string can come out radically different lengths. That's why "tokens" is the lingua franca of LLM cost and capacity, not "characters" or "words."

Watching BPE build a token

The widget below walks through a stylised version of BPE on the input "unbelievable". Each step merges one adjacent pair of fragments into a new symbol — the same operation BPE training runs many millions of times against a corpus to learn its merge order.

BPE merge walkthrough · input unbelievable
Step 1 of 8
unbelievable
Start at the byte level. Every character is its own fragment — 12 in total.
BPE training counts every adjacent pair across the whole corpus, then always merges the most frequent one.
12 fragments

Note: this merge order is illustrative — it teaches the pattern of merging adjacent pairs. The real Qwen3 vocabulary was learned from a massive training corpus and its merges differ in both order and final token boundaries.

Tokenization quirks worth seeing

Most surprises beginners hit with LLM cost or behaviour trace back to one of these. Each row shows a small input alongside the tokens it usually decomposes into.

Tricky tokenization cases
Token strings are plausible; exact ids vary by tokenizer release.
Leading whitespace
"cat"
cat

Bare "cat" is one token. Compare with " cat" below — they are different vocabulary entries.

Leading whitespace
" cat"
·cat

The leading space is part of the token. " cat" is its own id, not the same as "cat".

Contraction
"can't"
can't

BPE splits the apostrophe off — "'t" is a frequent enough suffix to earn its own token.

Expanded form
"can not"
can·not

Two whole words, two tokens. Two tokens here vs two for "can't" — but the model sees very different ids.

URL
"https://example.com"
https://example.com

URLs fragment by punctuation. The model sees scheme, separator, host, TLD as distinct pieces.

Code
"const x = 42;"
const·x·=·42;

Code merges common keywords ("const") but splits punctuation. Whitespace is absorbed into the following token.

CJK
"你好"
你好

CJK characters usually land as one token each, but common bigrams like "你好" can merge into a single token too — much like English "the" earning its own id.

Devanagari
"नमस्ते"
स्ते

Rarer scripts decompose into more, smaller pieces than English — here 6 characters become 4 tokens — but each piece is still a legible sub-syllable, not an opaque byte fragment. The chars/token ratio drops well below 1.

Number with commas
"1,000,000"
1,000,000

Numbers are split into individual digits — there is no "000" token. The commas stay separate too, so the model reasons over numbers one digit at a time.

Multilingual token map — one sentence, six scripts
Captured from Qwen3.5's tokenizer.json; ids omitted.
English
The cat sat on the mat
The·cat·sat·on·the·mat
6 tokens
中文 (Chinese)
猫坐在垫子上
坐在子上
4 tokens
日本語 (Japanese)
猫がマットの上に座った
マットの上にった
6 tokens
العربية (Arabic, RTL)
جلس القط على الحصيرة
جلس·القط·على·الحصيرة
5 tokens
हिन्दी (Hindi)
बिल्ली चटाई पर बैठ गई
िल्ली·चटा·पर·बैठ·गई
9 tokens
Español (Spanish)
El gato se sentó en la alfombra
El·gato·se·sentó·en·la·alfombra
9 tokens

The same sentence costs a different number of tokens in each language — but it is not simply “English is cheapest.” Qwen3.5's 248,320-entry vocabulary has learned whole-word merges for high-resource languages, so Chinese here is actually the most compact (4 tokens), while Spanish and Hindi cost the most (9). Token counts only compare within this one tokenizer.

When text falls outside those learned merges — emoji, or rarely-seen scripts — the tokenizer drops to raw bytes: each emoji like 🦫 is 3 tokens, and Amharic runs about 1.7 tokens per character.

Actual Qwen3.5-0.8B tokenizer output, captured offline — exact token splits (token ids omitted), not a live run in your browser.

In the next chapter (Embeddings) we'll see how each of these token ids becomes a high-dimensional vector that the model can actually compute with — the bridge between integers and the continuous space attention operates in.

Engineering takeaways
  • Cost scales with tokens, not characters: billing, KV-cache, and latency are all per-token.
  • A leading space is part of the token — "cat" and " cat" are usually different vocabulary entries.
  • Different model families tokenize the same string differently, so token counts only compare within one tokenizer.
Try this

Using the tokenizer demo, find one sentence that gives you the highest chars/token ratio you can hit, and a different sentence that gives you the lowest.

Quick check
1. Why do modern LLMs use sub-word tokens instead of words?
2. What does a higher chars/token ratio usually mean?
3. Are the tokens for "cat" at the start of a sentence and " cat" mid-sentence the same id?

Try it now

Loading the interactive demo…