Read a transformer layer end-to-end: pre-norm, attention, residual, pre-norm, MLP, residual, repeat.

Each sub-block was introduced in isolation; you still need to see how they wire together and how the stack composes.

8 min

forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling

Glossary · 6 terms

decoder layer: One pre-norm block: x_attn = x + Attention(RMSNorm(x)); x_out = x_attn + MLP(RMSNorm(x_attn)). Qwen3.5-0.8B stacks 24 of these.
residual stream: The un-normalized hidden state running the full depth of the tower. Every block reads from it and writes a delta back.
sub-block: Either attention or the MLP. Each layer contains exactly two sub-blocks, each wrapped in its own norm + residual.
hybrid attention: Qwen3.5's recipe: every fourth layer uses softmax attention, the rest use a linear variant (GatedDeltaNet).
linear attention: An approximate attention that runs in fixed memory per token. Trades some exact-recall for very cheap long-context decode.
capture point: A named place inside a layer (pre_attn_input, mlp_output, ...) where the inspector reads activation stats out.

The full transformer block

The last five chapters introduced individual ingredients — attention, multi-head & GQA, RoPE, RMSNorm, the gated MLP — as if each lived in isolation. They do not. Every one of Qwen3.5-0.8B's 24 decoder layers stitches them together in the same fixed pattern, then the model stacks that pattern 24 times. This chapter zooms out: one layer as a whole, and a stack of them as a tower.

Pre-norm: norm, sub-block, residual

Qwen3.5 follows the modern pre-norm recipe. Inside each layer there are two sub-blocks (attention and MLP), and each is wrapped the same way: normalize the input, run the sub-block on the normalized copy, add the result back into the residual stream.

x_{attn} x_{out} = x + Attention (RMSNorm (x)) = x_{attn} + MLP (RMSNorm (x_{attn}))

Two RMSNorms, two residual adds, one attention, one MLP — that is the full layer. The 3D widget below shows it as a tall, narrow tower: the residual stream runs from bottom to top, and each "ring" you can see in the tower is one of those 24 layers.

What each piece contributes

Attention (chapter 4) — the only place in the layer where information moves across tokens. Without it, every position would evolve independently.
Multi-head & GQA (chapter 5) — attention is run in parallel by many heads sharing a smaller pool of K/V projections, so the model can look at several patterns at once without inflating the KV cache (a KV cache stores each token's keys and values so they aren't recomputed every step — its own chapter comes later).
RoPE (chapter 6) — the rotation applied to Q and K inside attention. It's how the model knows token 5 is later than token 3 without a separate position embedding.
RMSNorm (chapter 7) — applied before each sub-block. It collapses the magnitude of the input to roughly sqrt(hidden_dim) ≈ 32 so the attention scores and MLP activations don't explode.
MLP (chapter 8) — a per-token feed-forward with the SwiGLU gated nonlinearity. It holds one of the largest blocks of the model's parameters, and is where per-token feature computation happens.

Anatomy of one layer · 24× stacked

24 layers, identical body

Each of Qwen3.5's 24 layers has the same six-step body. The 3D tower to the right gives you the stack view; this strip gives you the anatomy view, with layers 1–22 collapsed because they all look exactly like layer 0.

Layer 0 (input)

x_in → x_out

generic anatomy placeholder — Qwen3.5's actual layer 0 is GatedDeltaNet, not full attention

skip

RMSNorm

→

Attentionmulti-head + RoPE

→

RMSNorm

→

MLPgated · SwiGLU

→

Layers 1 … 22

structure same as above · 22 more

Layer 23 (final, feeds into Final RMSNorm + LM head)

x_in → x_out

skip

RMSNorm

→

Attentionmulti-head + RoPE

→

RMSNorm

→

MLPgated · SwiGLU

→

The two + circles are the two residual adds — that's the highway the rest of the course keeps pointing at. The dashed arc is the skip path: the input jumps over the norm + sub-block untouched and meets the sub-block's output at the + — two paths merging, not one more step in a chain. Note that linear-attention layers (most of Qwen3.5's stack) swap the green attention box for GatedDeltaNet; the rest of the body is the same.

The residual stream

The single most important and least-named concept in a decoder LLM is the residual stream: a per-token vector that enters the layer stack from the embedding, gets written into by every sub-block, and is read at the top by the LM head — the final projection that turns the vector into next-token scores (next chapter). Nothing inside the layer replaces it — both attention and MLP results are added back. Every operation is literally h := h + Δ.

Concretely, with a toy 4-slot stream: say h = [1.2, −0.4, 0.8, 0.3] and the attention sub-block computes Δ = [0.1, 0.3, −0.1, 0.0]. The new stream is h + Δ = [1.3, −0.1, 0.7, 0.3]. The write only nudges a few slots — the last one passes through untouched, and the rest keep most of their old value.

The animation below shows it concretely: a single hidden vector, two layers' worth of writes (attention and MLP, twice), and the LM head at the top reading the accumulated total. The bar grows because every write is a strict add — the stream never shortens.

The residual stream — one vector, written into 2× per layer

The hidden state for one token, drawn as a column that grows as each sub-block writes a correction into it. Attention writes from the left, MLP writes from the right. Nothing ever overwrites — every operation is h := h + Δ. Two layers shown; Qwen3.5-0.8B does this 24 times.

Step 0: h₀ (embedding + RoPE)

Three things to notice. First, the stream never gets narrower — there is no operation in the layer that subtracts or replaces. Second, the same column is read and written by every sub-block; it's a single running address space the whole network shares. Third, the LM head at the top reads the top of the stream — every layer's contribution is visible to the final prediction, not just the last one.

Illustrative — heights are a synthetic stand-in for the residual-stream magnitude; Qwen3.5-0.8B does this 24 times. Not live output from the model.

Two consequences fall out. First, gradients: backprop can flow straight through the identity path even in a 24-layer stack, which is why deep transformers train at all. Without the residual, the gradient (the training signal that adjusts the weights — covered in the Training chapter) gets repeatedly multiplied by each layer's transformation (its Jacobian — don't worry about that term here), and across a stack this deep it tends to vanish or explode (it shrinks to almost nothing or blows up as it passes back through many layers); the residual's identity path is what lets it survive. Second, magnitudes: the residual stream grows in L2 (the vector's length — the √ of the sum of its squared components) with depth, since each layer adds a non-zero contribution, while every individual sub-block's output stays small. The 3D widget on the right colors each ring by the per-token L2 of the layer output — you can see the magnitude generally climb as you scan up the tower (the trend is upward, though any single layer can dip), roughly mirroring the bar growing in the animation.

Mental model: think of the residual stream as a piece of working memory that the network shares across all 24 layers. Each layer doesn't transform the representation; it contributes to a shared running sum. The "thinking" is the accumulation.

Reading the tower as a climb in meaning

The 3D widget colors each ring by raw magnitude (per-token L2), and that is genuinely all the inspector measures. But there is a softer, harder-to-pin-down story layered on top of the same tower: what the residual stream is carrying tends to shift as you climb. Probing studies of standard transformers suggest a rough trend — early layers lean toward local, surface-level structure (which token sits next to which, basic grammar), while later layers lean toward more abstract, sentence-level meaning. Treat that as loose intuition, not a measurement: it is a statistical tendency people have read out of trained networks, not a label stamped on any specific layer. And it is even fuzzier here, because 18 of our 24 layers are GatedDeltaNet rather than the softmax-attention stacks that most of that literature studied — so the early→grammar, late→abstract picture is at best a hand-wave for this particular model, not a per-layer guarantee.

You already saw the climb in action, one chapter back. The polysemy demo from the attention chapter fed three prompts that all end in the byte-identical token ·bank — finance, aviation, riverside — so the embedding lookup hands the stack the same starting vector at that final position, yet the model's next-token predictions split apart completely. That is the same divergence, now read as a climb up the tower: the three copies enter at the bottom identical, and each layer's write nudges them a little further apart as context accumulates, until by the top they point at ·account, ·angle, and a plain sentence end. Nothing new is claimed here — it is the depth angle on a divergence you have already seen.

Hybrid attention: not every layer is "full"

Qwen3.5 is a hybrid stack. Most layers use a fast linear-attention variant called GatedDeltaNet; only every fourth layer (indices 3, 7, 11, 15, 19, 23, 0-indexed, on the 0.8B model) uses the classic softmax(QKᵀ/√d) formulation from chapter 4. Linear-attention layers do almost all the across-token mixing under the tight memory budget of long contexts; full-attention layers appear at fixed intervals to do the heavy modelling that linear attention can't — e.g. looking back at one exact earlier token; chapter 12 shows a linear layer failing at exactly this. The 3D widget highlights full-attention layers in a warmer color so you can see the interleave at a glance. Chapter 12 will dig into how this hybrid recipe works and why it's the new default for high-throughput open LLMs.

Per-layer memory · fixed state vs growing cache

head_dim 256 · 2 KV heads · bf16

sequence length (tokens)16.4k (16,384)

2561.0k4.1k16.4k65.5k262.1k

GatedDeltaNet layer — fixed-size state512.0 KB

Recurrent state — independent of sequence length (does not move as you drag the slider).

Full-attention layer — KV cache32.0 MB

16,384 × 2048 bytes/token — grows linearly with context.

Bars use a log scale (512 KB → 512 MB) so the fixed state and the growing cache both fit; the byte counts above each bar are exact and grow linearly with tokens. At 256 tokens the two are equal.

At 16.4k tokens — full-attn KV: 32.0 MB (grows with context) vs GDN state: 512.0 KB (constant). The full-attention layer needs 64.0× the memory of one GatedDeltaNet layer.

Only 6 of 24 layers pay this growing-cache cost; the other 18 are GatedDeltaNet with constant memory. That 3:1 split (18 linear : 6 full) is why a long context stays affordable. Chapter 12 (KV cache & hybrid attention) digs into how this hybrid recipe keeps total memory in check.

Illustrative — bytes are computed from Qwen3.5-0.8B's exact dimensions (head_dim 256, 2 KV heads) and the linear layer's fixed state size; not live output from the model.

What the widget shows

Press Run to capture all seven hidden-state stats per layer for a single forward pass. The 3D tower renders 24 stacked tiles — one per layer — colored by per-token L2 of the layer output. Drag to rotate, scroll to zoom, click a layer to inspect its stats in the side panel. The mini bar chart inside the panel walks the seven capture points within the selected layer so you can trace the signal from the layer's input through the two sub-blocks and out the top.

Engineering takeaways

Every layer is the same shape — two pre-norm + sub-block + residual pairs — and that shape stacks 24 times.
Residual stream magnitude grows with depth even though each individual sub-block's output stays small.
Qwen3.5 isn't pure transformer — only 6 of its 24 layers use softmax attention; the rest run linear attention.

Try this

After the auto-run, drag the 3D tower to face you, then click layer 3 and layer 4. How does the 'Layer 3 - Full attention' label change vs 'Layer 4 - Linear attention', and how do the warm/cool ring colors compare across the tower?

Quick check

1. In a Qwen3.5 layer (pre-norm), what is the actual order of operations?

RMSNorm -> Attention -> add to residual; RMSNorm -> MLP -> add to residual.Attention -> RMSNorm -> MLP -> RMSNorm -> residual at the end.Attention -> MLP -> single RMSNorm at the end.

2. How many of Qwen3.5-0.8B's 24 layers use the softmax attention you read about in chapter 4?

All 24.6 — one in every four.12 — every other layer.

3. Why does the per-token L2 of the residual stream grow as you scan up the tower?

Each layer adds its sub-block outputs into the residual stream; magnitudes accumulate even though each individual write is small.Later layers use larger learned gains in RMSNorm.The model upcasts to higher precision at deeper layers.

Read a transformer layer end-to-end: pre-norm, attention, residual, pre-norm, MLP, residual, repeat.

The full transformer block

Pre-norm: norm, sub-block, residual

What each piece contributes

The residual stream

Reading the tower as a climb in meaning

Hybrid attention: not every layer is "full"

What the widget shows

Try it now