Explain how the last hidden state becomes a vector of logits, and why for Qwen3.5 the LM head is literally the embedding matrix re-used.

Chapter 9 ends with a hidden state; chapter 11 starts with logits. Something has to turn one into the other.

6 min

forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling

Glossary · 6 terms

LM head: The final linear projection of the model. Multiplies the last hidden state by a [d, V] matrix to get one logit per vocab token.
logit: One unbounded real-valued score per vocab entry — higher means the model favors that token more. They aren't probabilities yet; the sampling chapter turns them into probabilities with softmax.
weight tying: Reusing embed_tokens.weight as the LM head, so the input lookup and output projection share the same tensor (saves a large fraction of params on small models).
vocabulary size (V): Number of distinct tokens the tokenizer can emit. Qwen3.5 uses V = 248,320.
hidden dim (d): Width of the residual stream. Qwen3.5-0.8B uses d = 1024 — every token carries 1024 floats through every layer.
inner product: Sum of element-wise products. logit_j = <last_hidden, W[j, :]>: the model's score for token j is how well the hidden state aligns with that token's row of the LM head.

The LM head: hidden state → logits

Chapter 9 left us at the top of the residual stream: a single hidden vector of width d = 1024 per token, after the final RMSNorm. Chapter 11 (sampling) will start from a vector of logits — one real-valued score per vocab token. The step that connects them is one matmul, and it has a name: the LM head.

One line of math

logits = h_{last} \cdot W_{lm}^{⊤}

h_last is the last token's hidden state — shape [1, 1024] when the model is generating one token at a time (called decode — the next chapters cover it). W_lm is the LM head weight matrix — shape [V, d] = [248,320, 1024] for Qwen3.5-0.8B. The transpose makes it [1024, 248,320]; multiplying gives an output of shape [1, 248,320] — one logit per token in the vocabulary.

One question worth pausing on: the stack produced a hidden vector for every token in the prompt, so why does only the last one go into the LM head? Because each position predicts its own next token — and when the model is generating, the only prediction it needs is the one after the final token. The other positions' predictions aren't wasted: training scores all of them at once, as the training chapter shows.

The animation walks the matmul cell by cell. The scan beam highlights one output column at a time, with the matrix column that produces it lit up beside it. Every output entry is one independent inner product — which is also why this op parallelizes so cleanly on a GPU. The matrix is drawn transposed, so each token's fingerprint appears as a column — the label under the beam tracks which token's column is being scored.

The final matmul — hidden state → logits

One matrix-vector product at the very top of the stack: logits = last_hidden @ embed_tokens.weight.T. For Qwen3.5-0.8B that means a [1, 1024] vector multiplied by a [1024, 248320] matrix → a [1, 248320] output, one score per vocab token. The scan beam shows which output column is being produced.

Every output entry is one inner product: logit_j = sum_i (last_hidden[i] · W[i, j]). Each column of this [d, V] matrix — equivalently, a row of the untransposed [V, d] weight — is the "fingerprint" of one vocab token; when that fingerprint points in roughly the same direction as the hidden state, the logit for that token is high.

Illustrative/schematic — the 12×36 grid and bar heights are stand-ins for the real [1024 × 248,320] matmul, not actual model logits.

What each output cell means

Think of each row of W_lm as a learned "fingerprint" for one vocab token. The logit for token j is the inner product of h_last with that row:

logit_{j} = ⟨ h_{last}, W_{lm} [j, :]⟩

So high logit ⇔ hidden state points in roughly the same direction as that token's row. The geometry of the hidden state is what determines the next-token distribution, and the LM head's rows are the dictionary the model uses to read out that geometry.

The inner product unpacks as h · w = |h|·|w|·cos θ, where cos θ is the cosine of the angle between the two vectors — it runs from +1 when they point the same way, through 0 at a right angle, to −1 when they point opposite ways. With every token's row at a comparable length, the logit is really just a readout of the angle between the hidden state and that row. Toggle the three cases below — aligned, orthogonal, opposed — and watch the per-dimension products (which sum to the logit) and the cosine move together.

A logit is an inner product

Every vocabulary token owns a row w_t in the output matrix (the tied embedding); its logit is the dot product of that row with the final hidden state h, and the softmax (shown elsewhere in the course) turns the whole vector of logits into probabilities. All three candidates here have the same length |w| = 1.5, so the only thing that moves the logit is the angle to h.

logit_{t} = h \cdot w_{t} = d \sum h_{d} w_{t, d} = ∥ h ∥ ∥ w_{t} ∥ cos θ

Per-dimension contribution h_d · w_dpositivenegative

+0.817

+0.161

+0.494

+0.040

+0.363

+0.252

+0.010

+0.091

Σ_d h_d·w_d = logit = +2.230 — the eight bars above sum to this single score.

cos θ = (h·w)/(|h||w|) = +1.000θ ≈ 0°(|h| ≈ 1.487, |w| = 1.5)

Illustrative 8-dimensional toy vectors (the real hidden state is 1024-dim) — not live output from the model.

This is why you'll sometimes see plausible-looking tokens you didn't expect at the top of the top-K: their rows happen to be aligned with the hidden state, even if the model wouldn't end up sampling them. The top-K panel on the right will show this concretely once you hit Run.

Weight tying: the same matrix, used twice

Here's the surprising part. The matrix W_lm in the formula above isn't a separately-learned tensor. For Qwen3.5 (and most modern decoder LLMs sub-7B), it is literally the embedding matrix from chapter 3 — the same [248,320, 1024] grid of floats that mapped token ids to vectors at the bottom of the stack is reused (transposed) at the top.

Weight tying — one matrix, used twice

step 1/4

Qwen3.5-0.8B (and most modern decoder LLMs) sets tie_word_embeddings = true. That means the embedding matrix at the input and the LM head at the output are literally the same tensor in memory — the same [248,320 × 1024] grid of floats, used once for id → vector and once (transposed) for vector → vocab scores.

Step 1: embedding lookup — read row of embed_tokens.weight

Parameter savings: the matrix is 248,320 × 1024 ≈ 254.3M floats. Tying skips a second copy at the LM head — a ~254M-parameter reduction on a 0.8B-parameter model. That's close to a third of the model, gone, just by reusing the dictionary.

Conceptually tying says: the same dictionary that maps a token id to its incoming representation also maps an outgoing representation back to a vocab score. Reading and writing share one alphabet. Not every model ties — large GPT-style models sometimes keep them separate for a small quality win — but for sub-billion-parameter models, tying is the standard.

Chapter 3 noted this saves a second copy — here’s that copy, drawn.

Tied (what Qwen3.5-0.8B ships)

embedding table	254,279,680
second copy (LM head)	bought once, used twice
model total	852,985,920

Untied (hypothetical)

embedding table	254,279,680
second copy (LM head)	+254,279,680
embedding subtotal	508,559,360
model total	1,107,265,600

That one 254,279,680-float table is 29.8% of the whole model.

Scale sets the stakes. GPT-3’s table was 50,257 × 12,288 = 617,558,016 (≈617.6M), about 2.4× the size of this one — and GPT-3 tied its embeddings too. Leaving a table that big untied would have meant a second copy and ≈1,235,116,032 (~1.24B) parameters on lookup tables alone. The wider the vocabulary, the more tying saves, which is why it is standard from 0.8B up to 175B.

This is called weight tying, controlled by tie_word_embeddings = true in the config. There's a clean intuition behind it: the same "alphabet" the model uses to read tokens in should also be the alphabet it uses to write them out. If row j of embed_tokens.weight is what "the" looks like as a hidden vector, then the LM head should give "the" a high score whenever the residual stream looks like that row — and row_j · h is exactly that test.

The practical win: a sub-billion-parameter model saves ~254M parameters (close to a third of its total) by not duplicating the matrix. The trade-off: a small quality loss at very large scale, which is why GPT-style models beyond a few billion parameters sometimes untie the head. For Qwen3.5-0.8B, the savings dominate.

What you'll see on the right

The panel runs one inspector call: tokenize the prompt, run a forward pass, capture the top-K logits at the last position, render them as bars. Each row is one of the 12 highest-logit vocab tokens; the row highlighted with the primary accent is the token the model actually sampled (greedy at this stage — chapter 11 will introduce temperature and top-p).

What to notice: the panel runs softmax over those logits so the bar heights read as probabilities (each panel's bars sum to 1). But the raw scores the LM head emits are logits — arbitrary real numbers that can be negative and whose magnitudes aren't comparable across prompts. Chapter 11 is where temperature and top-p reshape that softmax into the model's actual choice.

Engineering takeaways

The LM head is just one matmul: last_hidden @ embed_tokens.T → logits. Every vocab entry gets one score.
Each row of the weight matrix is a learned "fingerprint" for one vocab token; the highest logit is the one most aligned with the hidden state.
Qwen3.5 ties embed_tokens.weight with lm_head.weight — the same ~254M-float tensor is used at the input lookup AND the output projection.

Try this

Hit Run on 'The cat sat on the' and look at the top-K logits panel. The highest bar is the model's pick. Now find a token further down the top-K whose text doesn't look 'like a real continuation' to you (e.g. a pronoun like ' his' turning up in a noun slot). What does its presence in the top-K tell you about the LM head's geometry?

Quick check

1. What is the shape of the LM head matrix for Qwen3.5-0.8B?

[1024, 1024] — one hidden vector per token.[248320, 1024] — one row per vocab token, of width equal to the hidden dim.[248320, 248320] — a vocab × vocab transition matrix.

2. What does 'tie_word_embeddings = true' mean in Qwen3.5's config?

The embedding values are clamped to [-1, 1] for stability.The same tensor is used for both the input embedding lookup and the output LM head — no separate lm_head.weight is allocated.The model can only emit tokens it was previously shown.

3. Which statement about the LM head's rows is true?

Each row is the average hidden state of every prompt that ever produced that token.Each row is a learned vector — token j's logit is the inner product of that row with the last hidden state.Each row is a one-hot vector for one vocab id.