Chapter 4 · Self-attention

Read a softmax(QK^T/sqrt(d))V attention map and explain what every cell, row, and column means.

Without attention, a transformer has no way to let one token's representation depend on what came before it.

9 min
forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling
Glossary · 7 terms
query (Q)
A learned projection of a token saying 'what am I looking for?' — one query vector per token, per head.
key (K)
A learned projection saying 'what do I offer if you look at me?' — compared with Q via dot product.
value (V)
A learned projection that carries the actual content a token contributes when it is attended to.
attention pattern
One row of softmax(QK^T/sqrt(d)): a probability distribution over earlier positions for a single query token.
scaled dot product
QK^T divided by sqrt(d_head). The scale keeps softmax in a regime where gradients don't vanish as d_head grows.
causal mask
Setting the upper triangle of the score matrix to -inf so position i can only attend to positions 0..i.
self-attention vs cross-attention
Self-attention (this chapter) draws Q, K and V from one running sequence — tokens look at each other. Cross-attention is the sibling from machine translation: queries come from one sequence, keys/values from another. Our text-only decoder path only ever does self-attention.

Self-attention: how every token looks at every other token

Up to this chapter we've turned text into tokens (chapter 2) and tokens into vectors (chapter 3). The interesting question is now: how does a token know about the rest of the sentence? The answer modern LLMs use is self-attention.

The intuition

Intuitively, each token looks over all the earlier tokens and pulls in the information it needs — like re-reading the sentence and focusing on the few words that matter for what comes next.

Imagine you're the model and someone hands you the prompt "The cat sat on the ___" and asks for the next word. To guess well you have to look back at the earlier tokens: cat tells you the subject is an animal, sat tells you it's resting on something, on the tells you a noun is coming next. Self-attention is a learned, tunable-by-training version of that "look back": for every token in the sequence the model decides how much it should pay attention to every other token. The panel on the right runs exactly this prompt and shows you the attention pattern.

Same word, same starting vector

Here's the cleanest way to see why that look-back is necessary. The embedding lookup from chapter 3 is context-free: it's a table indexed by token id, so the word bank fetches the same row of embed_tokens.weight whether the sentence is about money, an aircraft turn, or a riverside picnic. At the moment the last token enters the layer stack, the model literally cannot tell those meanings apart — the three starting vectors are byte-identical.

Pulling those meanings apart is the whole job of the layers above, and you can watch the result at the other end of the pipeline: feed this model three prompts that all end in the very same token and its next-token bets diverge completely. One caveat to keep you honest — in this model the disambiguation work is shared by the 6 full-attention layers and the 18 GatedDeltaNet linear-attention layers (chapter 12), so "attention did it" is shorthand for "the context-mixing stack did it."

One token, three contexts · real model output
Top-5 next-token predictions · this exact Qwen3.5-0.8B checkpoint
·banktoken id 5883

All three prompts end in the byte-identical token ·bank — the embedding lookup returns the same row of embed_tokens.weight every time, so the layer stack starts from the same vector at that position.

money
My rent is paid straight from the·bank
  1. ·account0.351
  2. .0.275
  3. ,0.241
  4. ·and0.036
  5. ·in0.022

The model lines up ·account at #1 — it read the “rent is paid from” context, not the word alone.

flying
The aircraft rolled into a 30-degree·bank
  1. ·angle0.277
  2. ,0.188
  3. ·at0.160
  4. .0.126
  5. ·and0.111

·angle jumps to #1 (“bank angle” is the aviation phrase) — a continuation the other contexts never surface.

riverside
They had a picnic on the river·bank
  1. .0.468
  2. ,0.224
  3. ·and0.051
  4. ·in0.049
  5. ·where0.040

Here the model mostly thinks the sentence is complete (47% for “.”), with location words like ·where further down.

Real recorded output, not scripted: each list was captured by running these exact prompts (raw completion, no chat template) through the LM-head inspector of this playground — the same Qwen3.5-0.8B checkpoint the rest of the course runs — which captures the top-12 next-token logits at the final position; the probabilities shown are the softmax over those captured logits, as the inspector displays them. The “same starting vector” claim is by construction: the embedding lookup is a table indexed by token id, and all three prompts end in token id 5883.

Q, K, V — three projections of the same vector

Each token comes in as a vector x (its embedding, after earlier layers have refined it). The model linearly projects x three different ways (a "projection" here just means multiplying the vector by a learned matrix to get a new vector):

  • Q = x · Wq — the query: what is this token looking for?
  • K = x · Wk — the key: what does this token offer if you look at it?
  • V = x · Wv — the value: what should this token contribute when it gets looked at?

Each is a vector of length d_head (the per-head dimension — the same number the original "Attention Is All You Need" paper calls d_k and the model's config calls head_dim). Crucially these aren't three different tokens — they're three different views of the same token, learned separately so attention can do something more interesting than just "compare embeddings." In this model: x is 1,024 numbers → 8 query vectors of 256 each, plus 2 key/value pairs of 256 (the 8-vs-2 trick is GQA, next chapter).

One more thing a beginner has to get right: the same three matrices Wq, Wk, Wv are applied to every token at every position — they're learned once and shared across the whole sequence, so the projection itself carries no notion of where a token sits; positional information is injected separately (by RoPE, chapter 6).

A naming note while it's fresh: because Q, K and V all come from the one running sequence, this whole chapter is self-attention. Its sibling, cross-attention, draws queries from one sequence and keys/values from another — the original use was machine translation, where a decoder query reads an encoder's keys. Cross-attention doesn't appear anywhere in this model family, though — not even in the full multimodal checkpoint. Vision fuses in via early fusion instead: projected image-patch tokens get spliced directly into the same self-attention sequence as the text tokens, which is off in this demo — none of the 6 full-attention layers here do cross-attention.

The formula

The core operation, for a single head, is one line of math:

Let's read it left to right. Q · Kᵀ (the ᵀ flips rows and columns — notation for "dot every query with every key") is a [seq_len, seq_len] matrix: cell (i, j) is the dot product of token i's query with token j's key — a raw score for "how well does i's question match j's offer?"

We divide by √d_head for a numerical reason: as d_head grows, dot products grow on average too, and without scaling the softmax would saturate (one cell at 1.0, the rest near 0.0) and gradients would vanish (a gradient is the training signal that tunes the weights — covered in the Training chapter; don't worry if it's fuzzy here). Dividing by the standard deviation of a random dot product (≈ √d_head) keeps the scores in a regime where softmax is informative.

The widget below makes that one sentence tangible. Drag the head dimension and watch the same set of key alignments: without the scale, the distribution collapses toward a single key as d_head grows; with the ÷√d_head scale it stays exactly the same at every width.

Why divide by √d_k
Same key alignments, swept across head_dim
41664256
without ÷√dtop = 1.000
key 0
1.000
key 1
<0.001
key 2
<0.001
key 3
<0.001
key 4
<0.001
key 5
<0.001

softmax(U · √d) — peaks higher as d grows; near one-hot at 256.

with ÷√dtop = 0.523
key 0
0.523
key 1
0.235
key 2
0.117
key 3
0.071
key 4
0.035
key 5
0.019

softmax(U · √d / √d) = softmax(U) — identical at every d.

As d_k grows, a random dot product's spread grows like √d_k, so the unscaled scores blow up and softmax collapses toward one key — vanishing gradients. Dividing by √d_k cancels exactly that growth, so the scaled distribution is unchanged by the head width. Qwen3.5-0.8B uses head_dim = 256, so it divides every score by √256 = 16.

Illustrative — the six key alignments are fixed synthetic numbers scaled by √d, not live output from the model.

softmax turns each row of the score matrix into a probability distribution: every row sums to 1. That's the per-token attention pattern. Multiplying that [seq_len, seq_len] distribution by V mixes the values together, weighted by the pattern — and that mixture is what flows out of the attention layer as the new representation of token i.

One thing that's easy to picture wrong: that mixed value vector is not a wholesale replacement for the token. It's a small correction that gets added back onto the vector the token carried in — attention writes a delta onto the running representation, it doesn't overwrite it. That running representation has a name, the residual stream, and the add-don't-replace rule is the seam every block in this model is wired around; the full transformer block chapter shows the wiring.

That [seq_len, seq_len] matrix is also where attention gets expensive. Three optional deep-dives go one level further — into the memory cost, how to put a number on it, and the trick that sidesteps it. All are skippable on a first read.

Causal masking — and why it's the load-bearing wall of decoder LLMs

Self-attention as defined so far is symmetric: every token can see every other token. For an encoder (think BERT — fill-in-the-blank tasks) that's fine. For a generative LLM it's a catastrophe. The whole point of training is to predict the next token from what came before. If the model is allowed to look at what comes after while training, it can short-circuit the entire problem by copying.

So before the softmax we set the upper-triangle of QKᵀ to -∞ (equivalently, add a mask matrix that's 0 on and below the diagonal and -∞ above, then softmax each row). After softmax those cells become exactly 0 — token i can only attend to keys 0 … i. The heatmap on the right is lower-triangular for exactly this reason. The bottom row — the last token of the prompt — is the only row that sees the whole prompt, which is why that row produces the next-token prediction.

The toggle widget below makes this concrete: flip the mask off, and watch the model's attention for the very first token reach forward into the future of the sequence.

Why we mask — flip it off and watch the model cheat
Causal maskON

The matrix below is softmax(QKᵀ / √d) for the same five fixed scores as the widget above. Toggle the mask off and watch the highlighted row — the query for "The" — start attending to tokens that haven't happened yet.

attention weights — causal mask applied
The·cat·sat·on·the
The1.000.000.000.000.00
·cat0.230.770.000.000.00
·sat0.070.290.640.000.00
·on0.220.120.180.480.00
·the0.200.120.060.170.45
Mask ON. Row 0 (the query for "The") has only one position to look at — itself. Its prediction must rely on what "The" alone tells the model. That's the regime the model trains in: every next-token prediction is made before the answer is visible.

That single triangular mask is the only thing keeping training honest. It costs nothing at inference (the upper triangle is never even computed for decode steps) and everything pedagogically — it's the structural difference between encoder and decoder transformers.

A subtle but important consequence: the mask is what makes parallel training possible. With the mask in place, you can shove the entire sequence into the model in one forward pass and compute the loss at every position simultaneously — each position's prediction is independent of the others because none of them can see ahead. Without the mask, you'd have to feed tokens one by one. The triangular sparsity pattern is the single architectural decision that makes decoder-only training tractable at scale.

What each cell in the heatmap means

The heatmap on the right is exactly the softmax(QKᵀ/√d) matrix for one layer and one head:

  • rows are queries (the token doing the looking)
  • columns are keys (the token being looked at)
  • a bright cell at (i, j) means "when computing the next representation of token i, the model pulls a lot from token j's value"
  • every row sums to 1; the upper triangle is 0 because of causal masking

Look at the long edges the fan-out graph draws under the heatmap: a single full-attention step can route a far-back token's content straight to the current position, no matter how many tokens sit in between. Distance is free here — only relevance costs, because the score is a dot product, not a function of how far apart two tokens are. (This free-reach claim is about the 6 full-attention layers; the 18 GatedDeltaNet layers reach back through a running state instead, and the KV-cache chapter covers why the model still needs both.)

Why multiple heads, multiple layers

Different heads end up specializing — one might track "the previous token," another "the start of the sentence," another "syntactically related noun." Stacking attention layers lets later layers compose these patterns: chapter 5 looks at heads, chapter 9 at the full stack.

One subtlety about this model: it is hybrid. Only every 4th layer — 6 of its 24 — uses the full softmax self-attention you have been reading about. The other 18 use a cheaper linear-attention variant (GatedDeltaNet) that keeps a fixed-size running state instead of attending over all past tokens (covered in the KV-cache chapter). The live heatmap on the right reflects that: scrub the layer slider onto one of the 6 full-attention layers to see real scores; land on a linear layer and the panel just notes that it does not expose a softmax attention matrix.

A note on the head slider: this model has 8 query heads but only 2 key/value groups that those heads share — grouped-query attention (GQA), which the next chapter unpacks. For now, just read each head as its own attention pattern.

Working a tiny example by hand

The matrix algebra is easier to feel with concrete numbers. The widget below walks through two tokens — "The" and "cat" — through every step of the attention formula. All numbers are computed from fixed Q, K, V values; step through and watch the score matrix, scale, mask, softmax, and final output appear.

Toy 2-token attention · hand-worked numbers
Step 1 of 7
1. Embeddings

Two tokens, each a tiny 2-d vector.

x[The] = [1.00, 0.20]
x[cat] = [0.30, 0.90]
2. Q = x · W_Q, K = x · W_K, V = x · W_V

W_Q and W_K are identity; W_V is non-identity so the final mix is visible.

Q[The] = [1.00, 0.20]   Q[cat] = [0.30, 0.90]
K[The] = [1.00, 0.20]   K[cat] = [0.30, 0.90]
V[The] = [0.54, -0.16]   V[cat] = [0.33, 0.54]
3. Scores = Q · K^T

Each cell (i, j) is the dot product of Q[i] with K[j] — multiply matching slots, add up. One cell expanded: Q[cat]·K[The] = 0.3×1.0 + 0.9×0.2 = 0.48.

K[The]K[cat]
Q[The]1.040.48
Q[cat]0.480.90
4. Scale by sqrt(d_k)

d_k = 2, sqrt(d_k) = 1.414. Divide every cell.

K[The]K[cat]
Q[The]0.740.34
Q[cat]0.340.64
5. Mask (causal)

Token "The" cannot see "cat" (it comes later) — set the upper-right cell to -inf so it disappears after softmax.

K[The]K[cat]
Q[The]0.74-∞
Q[cat]0.340.64
6. Softmax per row

Each row becomes a probability distribution. Row "The" has only one valid key, so it attends 100% to itself. Row "cat" splits its attention 0.43 / 0.57 between the two tokens.

K[The]K[cat]
Q[The]1.000.00
Q[cat]0.430.57
7. Output = attn · V

Mix the value vectors using the attention weights. "The" ends up exactly V[The]; "cat" is a weighted blend of V[The] and V[cat].

out[The] = [0.54, -0.16]
out[cat] = [0.42, 0.24]
Each row of step 6 sums to 1 — that's softmax.

The causal mask, cell by cell

A second view of the same mask: a 5×5 grid where you can click any cell to see what "i sees 0..i" means concretely for this prompt.

Causal mask · 5 × 5
Green = allowed (i >= j) · gray = masked (i < j)
K 0
The
K 1
·cat
K 2
·sat
K 3
·on
K 4
·the
Q 0
The
Q 1
·cat
Q 2
·sat
Q 3
·on
Q 4
·the
Row 4 attends to col 0: "the" can see "The".

At training time, the mask makes teacher forcing safe: we feed the whole sentence in parallel and predict each next token, but no position can see its own future answer. Without the mask, next-token prediction would be a trivial copy.

At inference time, the same mask is still applied even though token N + 1 physically doesn't exist yet — this keeps the math identical between train and serve, which is what makes KV caching (chapter 12) valid.

Hit Run on the right. The model's predicted next token appears at the end of the prompt with a rainbow shimmer, and the heatmap shows the real post-softmax attention scores from the forward pass that produced it. Edit the prompt or change layer / head to watch the pattern (and the prediction) change.

Engineering takeaways
  • Each row of the heatmap is one query token's attention distribution; rows always sum to 1 after softmax.
  • The lower-triangular shape isn't decorative — it's the causal mask, the only thing keeping training honest and the reason all positions can be trained in parallel.
  • The scaling by sqrt(d_head) is what lets attention scale to wide heads without softmax collapsing to one cell.
Try this

Hit Run on the default 'The cat sat on the' prompt, then in the heatmap pick the bottom row (the last token before the prediction). Which earlier token gets the most attention? What does that tell you about the head you are looking at?

Quick check
1. What do the values in a single row of the attention heatmap sum to?
2. Why is the score matrix divided by sqrt(d_head) before softmax?
3. Why is the attention heatmap lower-triangular instead of a full square?

Try it now

Loading the interactive demo…