Explain why an LLM splits text into sub-word tokens instead of characters or words.

Models need a finite vocabulary that balances sequence length against vocabulary size.

7 min

forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling

Glossary · 6 terms

token: The unit a transformer actually reads — an integer id pointing into the model's vocabulary.
BPE: Byte-Pair Encoding: start from bytes, repeatedly merge the most frequent adjacent pair until the vocabulary is the target size.
sub-word: A token that is smaller than a word but larger than a character, e.g. "ization" or " the".
vocabulary: The full set of tokens the model knows. Qwen3.5 ships about 248,320 entries.
special token: A reserved token like <|im_start|> that marks structure (chat turns, end-of-text) rather than literal text.
chars/token: Characters divided by tokens — a quick proxy for how efficiently a string fits the tokenizer's vocabulary.

Tokenization: turning text into the model's vocabulary

Before a transformer can do anything with your sentence, the sentence has to be cut into pieces the model knows. Those pieces are called tokens, and the rule that decides where the cuts go is called the tokenizer. Tokens are not characters, and they are not words — they are something in between, chosen by an algorithm called Byte-Pair Encoding (BPE). The widget on the right runs Qwen3.5's exact tokenizer in your browser. Type, watch the chips refresh.

Why sub-words?

Tokenizers sit in an awkward middle. If we made every word a token, the vocabulary would need millions of entries to cover the long tail of names, typos and made-up words — and the embedding matrix (the big lookup table holding one vector per vocabulary entry — next chapter) would balloon along with it. If we went the other way and made every character a token, the model would have to chew through sequences five-to-ten times longer than the same text in words, and every layer's cost is at least linear in sequence length.

One fixed sentence makes the trade-off concrete. Take "The quick brown fox jumps over the lazy dog": as characters it is 43 tokens; as whole words, 9; a sub-word tokenizer lands at exactly 9 — as short as words, but built from a vocabulary that never runs out of entries.

BPE finds a middle path: start from raw bytes (or characters), then repeatedly merge the most frequent adjacent pair into a new symbol until the vocabulary is the size you want. Common words like the end up as a single token; rarer words like Antidisestablishmentarianism decompose into shorter familiar pieces. The tokenizer doesn't know what a word means — it only knows what byte sequences are statistically frequent in the training corpus. Because that base alphabet is the 256 possible byte values, the tokenizer can represent literally any input — any Unicode character, emoji, or script, even ones it never saw in training — so there is no "unknown" or out-of-vocabulary token; unfamiliar text just breaks into more, smaller (byte) tokens.

What Qwen3.5 does specifically

Qwen3.5 ships with a BPE tokenizer of roughly 248,320 entries. A few quirks are worth knowing:

Spaces live inside tokens. The BPE algorithm treats a space-prefixed word as a different symbol from the bare word, so "cat" at the start of a sentence and " cat" mid-sentence are usually two different tokens. The chips on the right mark a leading space with a small dot (·) so you can see this without it being invisible.
Special tokens mark structure. Sequences like <|im_start|> and <|im_end|> aren't text the model generates — they're chat-turn boundaries used by Qwen3.5's chat template. The widget renders any token whose decoded text contains <|…|> with a dashed border so you can spot them.
Multilingual coverage isn't uniform. Latin-script words are usually one or two tokens; CJK characters often land as one token each but unusual scripts can decompose into byte-level fragments. Try the multilingual example to see the difference.

Why this matters

Every downstream cost a language model has is paid per token: API billing, context-window limits (the cap on how many tokens the model can hold at once), KV-cache memory (the per-token working memory kept during generation — later chapter), and the wall-clock latency of generation. A prompt that looks short to a human can be long to the model if it uses rare punctuation, code, or languages the tokenizer didn't see much of in training. The characters-per-token ratio in the stats footer is the simplest way to feel this: prose in well-supported languages usually lands near 4 chars/token, code closer to 2-3, and an emoji-heavy or rare-script string can drop below 1.

Different model families use different tokenizers, and the same string can come out radically different lengths. That's why "tokens" is the lingua franca of LLM cost and capacity, not "characters" or "words."

Watching BPE build a token

The widget below walks through a stylised version of BPE on the input "unbelievable". Each step merges one adjacent pair of fragments into a new symbol — the same operation BPE training runs many millions of times against a corpus to learn its merge order.

BPE merge walkthrough · input unbelievable

Step 1 of 8

unbelievable

Start at the byte level. Every character is its own fragment — 12 in total.

BPE training counts every adjacent pair across the whole corpus, then always merges the most frequent one.

12 fragments

Note: this merge order is illustrative — it teaches the pattern of merging adjacent pairs. The real Qwen3 vocabulary was learned from a massive training corpus and its merges differ in both order and final token boundaries.

Tokenization quirks worth seeing

Most surprises beginners hit with LLM cost or behaviour trace back to one of these. Each row shows a small input alongside the tokens it usually decomposes into.

Tricky tokenization cases

Token strings are plausible; exact ids vary by tokenizer release.

Leading whitespace: "cat"
Leading whitespace: " cat"
Contraction: "can't"
Expanded form: "can not"
URL: "https://example.com"
Code: "const x = 42;"
CJK: "你好"
Devanagari: "नमस्ते"
Number with commas: "1,000,000"

Multilingual token map — one sentence, six scripts

Captured from Qwen3.5's tokenizer.json; ids omitted.

English: The cat sat on the mat
中文 (Chinese): 猫坐在垫子上
日本語 (Japanese): 猫がマットの上に座った
العربية (Arabic, RTL): جلس القط على الحصيرة
हिन्दी (Hindi): बिल्ली चटाई पर बैठ गई
Español (Spanish): El gato se sentó en la alfombra

The same sentence costs a different number of tokens in each language — but it is not simply “English is cheapest.” Qwen3.5's 248,320-entry vocabulary has learned whole-word merges for high-resource languages, so Chinese here is actually the most compact (4 tokens), while Spanish and Hindi cost the most (9). Token counts only compare within this one tokenizer.

When text falls outside those learned merges — emoji, or rarely-seen scripts — the tokenizer drops to raw bytes: each emoji like 🦫 is 3 tokens, and Amharic runs about 1.7 tokens per character.

Actual Qwen3.5-0.8B tokenizer output, captured offline — exact token splits (token ids omitted), not a live run in your browser.

In the next chapter (Embeddings) we'll see how each of these token ids becomes a high-dimensional vector that the model can actually compute with — the bridge between integers and the continuous space attention operates in.

Engineering takeaways

Cost scales with tokens, not characters: billing, KV-cache, and latency are all per-token.
A leading space is part of the token — "cat" and " cat" are usually different vocabulary entries.
Different model families tokenize the same string differently, so token counts only compare within one tokenizer.

Try this

Using the tokenizer demo, find one sentence that gives you the highest chars/token ratio you can hit, and a different sentence that gives you the lowest.

Quick check

1. Why do modern LLMs use sub-word tokens instead of words?

Sub-words are always shorter than words.A word-level vocabulary would need to enumerate every name, typo, and rare word, which makes the embedding matrix (the per-token lookup table — next chapter) impractically large.Word-level tokenizers cannot represent punctuation, so they are unusable.

2. What does a higher chars/token ratio usually mean?

The string uses common patterns the tokenizer learned well, so each token covers more characters.The model will hallucinate more on that input.The text is in a language the model cannot understand.

3. Are the tokens for "cat" at the start of a sentence and " cat" mid-sentence the same id?

Yes — BPE strips whitespace before encoding.No — BPE treats the leading space as part of the token, so they are usually two different vocabulary entries.