Explain how an integer token id becomes a high-dimensional vector and why nearby vectors mean related tokens.
The transformer needs a continuous representation of each token so it can express similarity and gradients can flow.
Glossary · 7 terms
- embedding
- The vector a token id is mapped to — a row of the embedding matrix that the rest of the network operates on.
- embedding matrix
- Shape [vocab_size, hidden_dim]. For Qwen3.5-0.8B that is roughly 248,320 by 1,024 — about 254M parameters.
- hidden_dim
- The width of every token's vector as it flows through the model. 1,024 for Qwen3.5-0.8B.
- tied embeddings
- Sharing the input embedding matrix with the output lm_head so the same vector space reads the prompt and writes the next-token logit.
- PCA
- Principal component analysis — projects high-dimensional points onto the two directions of greatest variance to make a 2D plot.
- cosine similarity
- Dot product of two vectors divided by their norms. Measures direction agreement and ignores magnitude; near 1 means very similar.
- semantic direction
- A vector difference like E(woman) − E(man) that approximately encodes one feature (gender, plurality, country-of) and can be added to or subtracted from other embeddings.
Embeddings: turning tokens into vectors
Tokenization gave the model a sequence of integers — one id per token. But integers carry no meaning the model can compute with: id 1234 isn't bigger, smaller, or more similar to 1235 in any useful way. The next step is to map each id into a continuous vector the rest of the network can do math on. That vector is the token's embedding.
Why vectors?
A continuous representation lets the model express degrees of similarity. The key idea is that a vector is just a direction in space, and words that mean similar things end up pointing in similar directions — so "related" becomes something the model can measure as an angle. The famous (if idealized) word2vec example — king − man + woman ≈ queen — showed that learned embeddings can even capture relationships as directions you can do arithmetic on. Modern LLM embeddings have much more entangled structure than word2vec did (one model serves dozens of tasks across many languages and code styles), so the clean analogy arithmetic is weaker — but the underlying idea is the same: nearby vectors mean related tokens, and the transformer layers spend their entire forward pass moving those vectors around in contextually useful ways.
"Nearby" here means pointing the same way, not "a short ruler-distance apart". Before we project real 1024-dim vectors down to a picture, build the core intuition in a flat 2-D toy you can grab: meaning lives in a vector's direction, and the angle between two vectors measures how related they are.
Two small pieces of math do all the measuring, and both fit on one line. The dot product is "multiply matching slots, add up": for a = [3, 4] and b = [4, 3], a·b = 3×4 + 4×3 = 24. The norm ‖a‖ is the vector's length: ‖a‖ = √(3² + 4²) = 5, and ‖b‖ = 5 too. Cosine similarity is just the dot product with the lengths divided out: cos = 24 / (5 × 5) = 0.96 — these two vectors point almost the same way. That one recipe — dot product over norms — is every similarity number in this chapter.
Slide the length and watch cos θ below stay put — magnitude doesn’t change the angle.
Swing the probe to line up with cat → cos θ → 1 (“related”). Turn it to a right angle → cos θ → 0 (“unrelated”). Point it the opposite way → cos θ → −1 (“contrary”). The length never enters the cosine — which is exactly why embedding similarity uses the angle, not straight-line (Euclidean) distance.
Illustrative only — these are hand-placed 2-D teaching arrows, NOT real embeddings. Qwen3.5-0.8B's vectors live in 1,024 dimensions; this flat picture exists to build the intuition.
The embedding table
The embedding lookup is a single matrix multiply (or, equivalently, a row gather): given an integer id, take row id of the embedding matrix. The matrix has shape [vocab_size, hidden_dim]. For Qwen3.5-0.8B that's about 248,320 × 1,024 ≈ 254 million parameters in this one table (a parameter is one learned number; this whole model is ~853 million of them) — roughly a third of the whole model.
Worth pausing on: all 254 million of those numbers started life as random noise. Nobody sat down and assigned tiger a row near lion, drew a gender axis, or wired up king − man + woman ≈ queen. Every cluster, every semantic direction, and every analogy you'll poke at in this chapter is an emergent by-product of fitting one objective — predict the next token — across a huge amount of text. The geometry is a fossil left behind by training, not a dictionary anyone hand-built.
Here is that lookup as a picture you can step through. One token id goes in, one row comes out — nothing is computed along the way:
The tokenizer hands the model one integer: 9059 (the token "·cat"). The embedding table is just a tall matrix — 248,320 rows, one per vocabulary entry.
The eight floats are illustrative stand-ins — the live demo on the right fetches the real 1,024-dim rows from the loaded model. The mechanism is the lesson: token id in, one row of the table out.
Modern LLMs (Qwen3.5 included) often tie the input embedding to the output unembedding (the lm_head that turns the final hidden state back into logits over the vocabulary — we cover that in the LM-head chapter; for now just read it as "the layer that scores the next token"). Tying these two matrices saves parameters, and forces the representations to be useful in both directions: the same vector space that reads the prompt also writes the next token's logit. The widget on the right plots rows from this exact shared matrix.
| vocab_size | 248,320 |
| hidden_dim | 1024 |
PCA: a 2D window into a 1024-dim space
Think of photographing a 3-D sculpture: you turn it to the angle that spreads its features out the most so you can see the structure. PCA picks that "best camera angle" for the 1024-dimensional embedding space — and, like a photo, it flattens away depth, so on-screen distance is a rough guide, not an exact ruler.
We can't see 1024 dimensions, so we project. Principal component analysis finds the directions of greatest variance in the data and projects onto the top two (or three). The result is the best linear flattening possible: the projected points preserve as much of the original spread as a 2D picture can.
The catch is that "as much as possible" is still very little. The top two components of a 1024-dim cloud typically explain just a few percent of the total variance — the rest is in the directions we threw away. So PCA is great for spotting clusters (tokens with similar overall direction in the embedding space land near each other), but two axes can't carry 1024 axes' worth of geometry.
What you should see
Click Load embeddings. The widget tokenizes ~110 words from six categories (animals, numbers, colors, countries, verbs, food), looks up each word's first-token embedding through the model worker, and runs PCA in your browser. Each word is plotted by its first sub-word token only, so a word that tokenizes into several pieces is represented here by just its opening fragment. Expect to see:
- Clusters by category. Animals near animals, numbers near numbers, etc. The clusters are usually clear despite living in a tiny 2D slice of the 1024-dim space.
- Sub-structure within clusters. Numbers often arrange themselves along a rough gradient. Colors split into warm/cool sub-regions. Food separates savoury from sweet.
- Nearest neighbours that look semantic. Click any point: the top-5 neighbours in the full hidden-dim space appear highlighted, and they're usually category-mates plus a few surprising near-misses across categories.
Why cosine, not Euclidean?
Cosine similarity measures the angle between two vectors, which is what encodes semantic direction in embedding space. Euclidean distance measures the full vector difference, which conflates direction with magnitude. The magnitudes that come out of training depend on optimization dynamics and aren't semantically meaningful — two synonyms can have very different norms but nearly the same direction. That's why cosine is the standard for embedding similarity. Some retrieval systems do use L2 on pre-normalised vectors, which is mathematically equivalent.
The scatter is for orientation, not measurement
One caveat governs everything you see in the scatter: it projects 1024-dim vectors onto just two axes, so it throws away 1022 dimensions of structure. Two points that look close on screen might be far apart in the real space, and vice versa. The colour groups still cluster (the dominant axes of variance often pick those up), but the distances lie — so never read similarity off the picture. That is exactly why the nearest-neighbour browser computes cosine similarity in the full 1024-dim space, not the projected 2D coordinates. Treat the scatter as a navigation aid for finding clusters, not a ruler.
Cosine similarity, side by side
The panel below skips the projection entirely: each bar shows a ballpark cosine similarity for this model family — the live scatter on the right uses the real embeddings. The ordering teaches the lesson — identical > synonym > related > antonym > cross-language > unrelated.
Values are illustrative; real measurements may vary by about +/- 0.1 between model builds. The teaching point is the *ordering*: identical > synonym > related > antonym > cross-language > unrelated.
Directions carry meaning
So far we've treated similarity as a single number between two whole vectors. But the space has finer structure: differences between embeddings can act like features. If E(woman) − E(man) roughly captures "the gender axis", then adding that difference to another vector should move it along the same axis — which is exactly the classic word2vec trick: E(king) − E(man) + E(woman) should land near queen. Subtract out "male", add in "female", and the royalty part comes along for the ride.
Does this still work in a 2026 LLM's embedding matrix? We measured it on this exact checkpoint, and the honest answer is "yes, with an asterisk". queen is already the #4 raw neighbor of king — behind the casing and plural variants King, King (no leading space), and kings — so the arithmetic doesn't summon queen from nowhere. What it really does is promote queen to #1, past every king-variant: the direction strips the "male" component enough that the female counterpart wins. The same trick generalizes: E(Paris) − E(France) + E(Japan) puts Tokyo at #1 — and because the 248,320-token vocabulary is multilingual, 东京 and 日本 show up in the same top-10 as Osaka and Kyoto.
Not every direction is so photogenic. A plurality direction E(cats) − E(cat) does exist — plural nouns dot positively with it, singulars hover near zero — but the magnitudes are faint, and the hoped-for "one < two < three" trend is barely visible. And the directions themselves are only loosely consistent: woman − man and queen − king agree at cosine 0.292, far from parallel. That's expected — this matrix was trained to help predict the next token (with input and output weights tied, doing double duty), not to pass analogy quizzes. Linear feature directions emerge as a side effect, smudged by everything else the one matrix has to encode.
- #1·queen0.578
- #2queen0.493
- #3·kings0.487
- #4·KING0.469
- #5·princess0.466
- #6Queen0.465
- #7King0.457
- #8國王0.433
- #9·Queen0.424
- #10·wanita0.423
Honesty check: ·queen is already king's #4 raw neighbor — behind the casing/plural variants ·King, King (no leading space), ·kings. What − man + woman really does is promote queen to #1, past every king-variant. Flip the toggle above to compare.
Take plur = E(cats) − E(cat) and dot it (cosine) against other words. Plural nouns land consistently positive, singulars near zero or negative — but the magnitudes are tiny.
Numbers one → five (a hoped-for "increasing plurality" trend — barely there):
Computed offline from this checkpoint's embed_tokens.weight ([248,320 × 1,024]): raw input embeddings before any transformer layer, fp32 cosines over the full vocabulary, input words excluded from neighbor lists. The arithmetic is approximate, not exact — this matrix was trained for next-token prediction with tied input/output weights, not for analogies. One symptom: cosine(woman − man, queen − king) = 0.292, so the two "gender directions" are only loosely parallel.
Why so many dimensions?
If we tried to do this in 3D, almost every concept would collide. There simply isn't room for "animals", "countries", "colors", "verbs", "syntactic role", "register", "language", and a dozen other axes of meaning to vary independently in a three-axis space. Qwen3.5-0.8B uses 1,024 embedding dimensions, which gives the model a generous amount of "room" — each independent feature can occupy its own direction without crowding the others.
Up next (Self-attention) we'll watch how these vectors actually move as the transformer pulls information between positions — the operation that gives "the cat sat on the mat" a different meaning from "the mat sat on the cat".
- The embedding matrix is roughly a third of a small model's parameters — it is not a free lookup table.
- PCA is good for spotting clusters but its distances lie, so compute similarity in the full hidden_dim space.
- Tied input/output embeddings force one matrix to do double duty, which constrains what the geometry can encode.
Click Load embeddings, then click an animal like 'tiger'. Then add the custom word 'banana' with the input box. Which existing category does its top-5 nearest-neighbour list lean toward, and why?