Read a transformer layer end-to-end: pre-norm, attention, residual, pre-norm, MLP, residual, repeat.
Each sub-block was introduced in isolation; you still need to see how they wire together and how the stack composes.
Glossary · 6 terms
- decoder layer
- One pre-norm block: x_attn = x + Attention(RMSNorm(x)); x_out = x_attn + MLP(RMSNorm(x_attn)). Qwen3.5-0.8B stacks 24 of these.
- residual stream
- The un-normalized hidden state running the full depth of the tower. Every block reads from it and writes a delta back.
- sub-block
- Either attention or the MLP. Each layer contains exactly two sub-blocks, each wrapped in its own norm + residual.
- hybrid attention
- Qwen3.5's recipe: every fourth layer uses softmax attention, the rest use a linear variant (GatedDeltaNet).
- linear attention
- An approximate attention that runs in fixed memory per token. Trades some exact-recall for very cheap long-context decode.
- capture point
- A named place inside a layer (pre_attn_input, mlp_output, ...) where the inspector reads activation stats out.
The full transformer block
The last five chapters introduced individual ingredients — attention, multi-head & GQA, RoPE, RMSNorm, the gated MLP — as if each lived in isolation. They do not. Every one of Qwen3.5-0.8B's 24 decoder layers stitches them together in the same fixed pattern, then the model stacks that pattern 24 times. This chapter zooms out: one layer as a whole, and a stack of them as a tower.
Pre-norm: norm, sub-block, residual
Qwen3.5 follows the modern pre-norm recipe. Inside each layer there are two sub-blocks (attention and MLP), and each is wrapped the same way: normalize the input, run the sub-block on the normalized copy, add the result back into the residual stream.
Two RMSNorms, two residual adds, one attention, one MLP — that is the full layer. The 3D widget below shows it as a tall, narrow tower: the residual stream runs from bottom to top, and each "ring" you can see in the tower is one of those 24 layers.
What each piece contributes
- Attention (chapter 4) — the only place in the layer where information moves across tokens. Without it, every position would evolve independently.
- Multi-head & GQA (chapter 5) — attention is run in parallel by many heads sharing a smaller pool of K/V projections, so the model can look at several patterns at once without inflating the KV cache (a KV cache stores each token's keys and values so they aren't recomputed every step — its own chapter comes later).
- RoPE (chapter 6) — the rotation applied to Q and K inside attention. It's how the model knows token 5 is later than token 3 without a separate position embedding.
- RMSNorm (chapter 7) — applied before each sub-block. It collapses the magnitude of the input to roughly
sqrt(hidden_dim)≈ 32 so the attention scores and MLP activations don't explode. - MLP (chapter 8) — a per-token feed-forward with the SwiGLU gated nonlinearity. It holds one of the largest blocks of the model's parameters, and is where per-token feature computation happens.
Each of Qwen3.5's 24 layers has the same six-step body. The 3D tower to the right gives you the stack view; this strip gives you the anatomy view, with layers 1–22 collapsed because they all look exactly like layer 0.
The residual stream
The single most important and least-named concept in a decoder LLM is the residual stream: a per-token vector that enters the layer stack from the embedding, gets written into by every sub-block, and is read at the top by the LM head — the final projection that turns the vector into next-token scores (next chapter). Nothing inside the layer replaces it — both attention and MLP results are added back. Every operation is literally h := h + Δ.
Concretely, with a toy 4-slot stream: say h = [1.2, −0.4, 0.8, 0.3] and the attention sub-block computes Δ = [0.1, 0.3, −0.1, 0.0]. The new stream is h + Δ = [1.3, −0.1, 0.7, 0.3]. The write only nudges a few slots — the last one passes through untouched, and the rest keep most of their old value.
The animation below shows it concretely: a single hidden vector, two layers' worth of writes (attention and MLP, twice), and the LM head at the top reading the accumulated total. The bar grows because every write is a strict add — the stream never shortens.
The hidden state for one token, drawn as a column that grows as each sub-block writes a correction into it. Attention writes from the left, MLP writes from the right. Nothing ever overwrites — every operation is h := h + Δ. Two layers shown; Qwen3.5-0.8B does this 24 times.
Three things to notice. First, the stream never gets narrower — there is no operation in the layer that subtracts or replaces. Second, the same column is read and written by every sub-block; it's a single running address space the whole network shares. Third, the LM head at the top reads the top of the stream — every layer's contribution is visible to the final prediction, not just the last one.
Illustrative — heights are a synthetic stand-in for the residual-stream magnitude; Qwen3.5-0.8B does this 24 times. Not live output from the model.
Two consequences fall out. First, gradients: backprop can flow straight through the identity path even in a 24-layer stack, which is why deep transformers train at all. Without the residual, the gradient (the training signal that adjusts the weights — covered in the Training chapter) gets repeatedly multiplied by each layer's transformation (its Jacobian — don't worry about that term here), and across a stack this deep it tends to vanish or explode (it shrinks to almost nothing or blows up as it passes back through many layers); the residual's identity path is what lets it survive. Second, magnitudes: the residual stream grows in L2 (the vector's length — the √ of the sum of its squared components) with depth, since each layer adds a non-zero contribution, while every individual sub-block's output stays small. The 3D widget on the right colors each ring by the per-token L2 of the layer output — you can see the magnitude generally climb as you scan up the tower (the trend is upward, though any single layer can dip), roughly mirroring the bar growing in the animation.
Mental model: think of the residual stream as a piece of working memory that the network shares across all 24 layers. Each layer doesn't transform the representation; it contributes to a shared running sum. The "thinking" is the accumulation.
Reading the tower as a climb in meaning
The 3D widget colors each ring by raw magnitude (per-token L2), and that is genuinely all the inspector measures. But there is a softer, harder-to-pin-down story layered on top of the same tower: what the residual stream is carrying tends to shift as you climb. Probing studies of standard transformers suggest a rough trend — early layers lean toward local, surface-level structure (which token sits next to which, basic grammar), while later layers lean toward more abstract, sentence-level meaning. Treat that as loose intuition, not a measurement: it is a statistical tendency people have read out of trained networks, not a label stamped on any specific layer. And it is even fuzzier here, because 18 of our 24 layers are GatedDeltaNet rather than the softmax-attention stacks that most of that literature studied — so the early→grammar, late→abstract picture is at best a hand-wave for this particular model, not a per-layer guarantee.
You already saw the climb in action, one chapter back. The polysemy demo from the attention chapter fed three prompts that all end in the byte-identical token ·bank — finance, aviation, riverside — so the embedding lookup hands the stack the same starting vector at that final position, yet the model's next-token predictions split apart completely. That is the same divergence, now read as a climb up the tower: the three copies enter at the bottom identical, and each layer's write nudges them a little further apart as context accumulates, until by the top they point at ·account, ·angle, and a plain sentence end. Nothing new is claimed here — it is the depth angle on a divergence you have already seen.
Hybrid attention: not every layer is "full"
Qwen3.5 is a hybrid stack. Most layers use a fast linear-attention variant called GatedDeltaNet; only every fourth layer (indices 3, 7, 11, 15, 19, 23, 0-indexed, on the 0.8B model) uses the classic softmax(QKᵀ/√d) formulation from chapter 4. Linear-attention layers do almost all the across-token mixing under the tight memory budget of long contexts; full-attention layers appear at fixed intervals to do the heavy modelling that linear attention can't — e.g. looking back at one exact earlier token; chapter 12 shows a linear layer failing at exactly this. The 3D widget highlights full-attention layers in a warmer color so you can see the interleave at a glance. Chapter 12 will dig into how this hybrid recipe works and why it's the new default for high-throughput open LLMs.
Bars use a log scale (512 KB → 512 MB) so the fixed state and the growing cache both fit; the byte counts above each bar are exact and grow linearly with tokens. At 256 tokens the two are equal.
Only 6 of 24 layers pay this growing-cache cost; the other 18 are GatedDeltaNet with constant memory. That 3:1 split (18 linear : 6 full) is why a long context stays affordable. Chapter 12 (KV cache & hybrid attention) digs into how this hybrid recipe keeps total memory in check.
Illustrative — bytes are computed from Qwen3.5-0.8B's exact dimensions (head_dim 256, 2 KV heads) and the linear layer's fixed state size; not live output from the model.
What the widget shows
Press Run to capture all seven hidden-state stats per layer for a single forward pass. The 3D tower renders 24 stacked tiles — one per layer — colored by per-token L2 of the layer output. Drag to rotate, scroll to zoom, click a layer to inspect its stats in the side panel. The mini bar chart inside the panel walks the seven capture points within the selected layer so you can trace the signal from the layer's input through the two sub-blocks and out the top.
- Every layer is the same shape — two pre-norm + sub-block + residual pairs — and that shape stacks 24 times.
- Residual stream magnitude grows with depth even though each individual sub-block's output stays small.
- Qwen3.5 isn't pure transformer — only 6 of its 24 layers use softmax attention; the rest run linear attention.
After the auto-run, drag the 3D tower to face you, then click layer 3 and layer 4. How does the 'Layer 3 - Full attention' label change vs 'Layer 4 - Linear attention', and how do the warm/cool ring colors compare across the tower?