Chapter 8 · MLP block

Describe how a SwiGLU gated MLP transforms a single token's hidden state and how it relates to the residual stream.

Attention mixes information across tokens but cannot do per-token feature computation; the MLP is where that work lives.

8 min
forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling
Glossary · 6 terms
MLP block
The per-token feed-forward sub-block inside a transformer layer. Runs the same little network on every position.
SwiGLU
Gated MLP variant: silu(gate_proj(x)) ⊙ up_proj(x), then projected back. The ⊙ is element-wise multiplication.
SiLU
Activation z * sigmoid(z). Smooth, near-linear for large positive z, suppresses negatives — acts as a soft gate.
intermediate_dim
The widened scratch space inside the MLP. Usually around 3-4x hidden_dim. 3584 for Qwen3.5-0.8B (vs 1024 hidden).
residual stream
The un-normalized hidden state running the full depth of the model that each block reads from and adds back into.
residual connection
The x_in + part of x_out = x_in + sub_block(norm(x_in)). Lets gradients skip the sub-block and stay healthy.

The MLP block: per-token feed-forward

First, the name. MLP stands for multi-layer perceptron — the plainest neural network there is: a matrix-multiply, then a squash function (the "activation"), then another matrix-multiply. "Feed-forward" is just another name for the same thing — numbers flow straight through, no loops.

Attention is the part of a transformer that mixes information across tokens: every position can read from every other position via softmax(QKᵀ)·V. The MLP block is the opposite — it touches every token in isolation, running the same little neural network on each position's hidden vector and writing the result back to that same position. Across-token mixing happens in attention; per-token computation happens in the MLP. A transformer layer alternates between the two.

Pre-MLP norm, then a gated MLP, then a residual add

Like its attention sibling, the MLP sub-block in Qwen3.5 is wrapped in a pre-norm + residual pattern:

The residual connection is the x_in + part. It exists for two reasons. First, gradients: backprop can flow straight through the identity path, so even a 24-layer stack stays trainable. Second, semantics: think of the residual stream as a highway that runs the full depth of the model (you'll see the residual stream drawn in the next chapter, the full transformer block). Each block reads from the highway, computes a small contribution, and adds it back. The model's prediction is the accumulation of every block's contribution — not the output of the last block alone.

One more thing before the formulas: why have a squash function at all? Because stacking matrix-multiplies with nothing between them buys nothing — A·(B·x) is just (A·B)·x, two linear maps collapsed into one. The squash between them is what stops the collapse and lets the block compute genuinely new things.

What "gated MLP" actually means

Qwen3.5 (like Llama, Mistral, and most modern open LLMs) uses the SwiGLU-style gated MLP. It has three linear projections per layer:

  • gate_proj: hidden_dim → intermediate_dim. Its output is run through a SiLU activation (also called Swish): silu(z) = z · sigmoid(z). SiLU is smooth and nearly linear for large positive z, but suppresses negative values — a soft gate.
  • up_proj: hidden_dim → intermediate_dim. This is the "value" path — the actual features being passed through.
  • down_proj: intermediate_dim → hidden_dim. Projects the element-wise product back to the residual stream's width.

Read this expression carefully. The is element-wise, not matrix multiplication ( means element-wise multiply — multiply matching entries position-by-position): each of the intermediate_dim features in silu(gate_proj(x)) multiplies the corresponding feature in up_proj(x). The gate-projection decides how much of each feature passes through; the up-projection supplies the value. The two are entangled per feature, then collapsed back to hidden_dim by down_proj.

Walk one feature by hand. Say the gate path produces z = 2 and the up path produces 0.5. Then silu(2) = 2 / (1 + e⁻²) ≈ 1.76, so the feature that passes through is 1.76 × 0.5 ≈ 0.88. A strongly negative gate would crush the same value toward zero instead.

The intermediate dimension is where the model has its "scratch space" — around 3.5× the hidden dim here (Qwen3.5-0.8B uses 3584 for a 1024-dim hidden state; 3–4× is common across models). These three matrices are one of the largest blocks of parameters in the model: 3 matrices × 1024 × 3584 numbers each, × 24 layers = 264,241,152 ≈ 264M — about a third of the ~0.8B total, roughly on par with the embedding table.

Same MLP, drawn as neurons

The matmul view above tells you the dimensions. The neuron view tells you the topology: two parallel wide projections (the gate and the value), an element-wise product, and a narrow projection back. Symbolic widths shown — Qwen3.5-0.8B uses 1024 → 3584 → 1024.

schematic — 6 / 12 nodes shown, model uses 1024 / 3584hidden inintermediate (gate & up)silu(gate) ⊙ uphidden outgate_projup_projsilu ⊙down_proj
Two parallel projections widen hidden → intermediate. Their element-wise product is the gated activation — the gate branch decides which features survive, the up branch carries the values. down_proj collapses back. Together these three matrices are about a third of Qwen3.5-0.8B (≈264M), roughly on par with the embedding table.
SiLU vs ReLU — the soft gate
z ∈ [−6, 6]
z = -1.0z = -60z = 660
SiLUReLU
z = -1.0
SiLU(z)
-0.269
ReLU(z)
0.000
gap (SiLU − ReLU)
-0.269
At z = -1.0 (negative), ReLU hard-zeros to 0.000, but SiLU leaks -0.269 — the soft gate still passes a little signal.

SiLU is smooth and dips slightly negative near z = 0 (minimum ≈ −0.278 at z ≈ −1.278) — a soft gate — so gradients keep flowing for small negative inputs, unlike ReLU's hard zero. Qwen3.5-0.8B uses SiLU inside its SwiGLU MLP.

Illustrative — SiLU and ReLU are plotted from their exact formulas; Qwen3.5-0.8B's MLP uses SiLU. Not live output from the model.

Why gated — and why it isn't any bigger

You might wonder why three matrices. The classic feed-forward block — the one in the original Transformer — used just two matrices: an up-projection, a fixed nonlinearity (ReLU in the original paper; GELU in later variants like BERT), then a down-projection. SwiGLU keeps the up and down matrices but adds a third, the gate, and replaces the fixed nonlinearity with a learned, multiplicative one. Instead of applying the same threshold to every feature, the gate lets the network decide — per token, per feature — how much of each up-projected value survives. That input-dependent gating is strictly more expressive than a fixed activation, and in practice it trains to lower loss.

The natural objection: doesn't a third matrix cost 50% more parameters? At the classic 4× intermediate it would — three 4×-wide matrices come to 12 h² against the plain block's 8 h². The conventional fix (Llama, Mistral) shrinks the intermediate to about ⅔ of that (≈ 8⁄3 · hidden), so three narrower matrices land right back at 8 h² — the break-even the diagram below shows. Toggle it to watch the bar re-proportion while its total length stays fixed. (Qwen3.5 doesn't shrink quite that far — see the note under the diagram.)

Three matrices, same budget
down( SiLU(gate·x) ⊙ up·x )· 3 matrices
Parameters (classic 4× intermediate = 8 h²)≈ 8 h² either way

The bar shows the idealized break-even: three matrices shrunk to 8⁄3 · hidden weigh the same 8 h² as a plain FFN's two 4×-wide matrices — which is how the gate gets added "for free." That's the trade-off Llama and Mistral take. Real checkpoints often run a touch wider: Qwen3.5-0.8B uses a 3584 intermediate (3.5× its 1024 hidden), so its three matrices total ≈ 10.5 h² about 30% more than a plain block. The gate is cheap, not literally free.

So much for the plumbing — here is what those intermediate features end up doing once trained:

What one MLP neuron fires on · three illustrative features

Each of the 3584 intermediate features acts like a tiny detector: its gate opens on tokens that match its pattern and stays shut on everything else. When it fires, its slice of down_proj adds that feature's signature to the token's residual stream. Darker = stronger firing.

"fires on city names"
IflewfromParistoTokyolastweek
writes its pattern into the residual stream
"fires on negation"
Thefoodwasnotbadatall
writes its pattern into the residual stream
"fires inside code"
Trythisloop:foriinrange(10):
writes its pattern into the residual stream

Illustrative — real features are messier and discovered, not designed. Nobody assigns a neuron to "city names"; training finds whatever detectors lower the loss, many fire on several unrelated things at once, and a clean one-neuron-one-concept story is the exception, not the rule.

The MLP contributes a small correction

Here is the surprising part: the MLP's raw output is much smaller than the residual stream it writes into. Because every block adds its output into the stream — it never overwrites — those contributions accumulate layer over layer, so the residual stream's per-token L2 grows even though each individual write is small (the L2 norm is just the vector's length — the square root of the sum of its squared entries). Each individual MLP's output is a small delta on top — a correction, not a replacement. Watch the chart below: MLP output per-token L2 is a fraction of the layer's full output L2. The bulk of the magnitude in the residual stream came from the input, not from this layer's MLP.

This is the "highway" intuition made quantitative. The all-layers strip shows that the size of the per-layer MLP contribution varies across depth: some layers contribute more than others, but no single layer dominates. Predictions emerge from the sum of many small writes — which is exactly the design choice that makes deep transformers learnable.

One unifying point for Qwen3.5's hybrid stack: every layer — full-attention or linear — has this same MLP block. Only the token-mixing half differs between the two layer kinds; the SwiGLU MLP that follows it is identical everywhere.

Engineering takeaways
  • The MLP touches every token in isolation — there is no cross-token information flow inside an MLP block.
  • gate_proj/up_proj/down_proj are one of the largest parameter blocks — about a third of Qwen3.5-0.8B (≈264M), roughly on par with the embedding table.
  • Each MLP writes a small correction to the residual stream; the prediction is the sum of many small writes, not the last block's output.
Try this

After the auto-run, look at the 'MLP contribution across all layers' strip. Click the layer with the largest bar and the layer with one of the smallest non-empty bars. How does the 'MLP / layer-output ratio' percentage compare between them?

Quick check
1. In SwiGLU, what kind of multiplication is the ⊙ between silu(gate_proj(x)) and up_proj(x)?
2. How does a single MLP block's output magnitude usually compare to the residual stream it writes into?
3. Why is the MLP one of the largest blocks of parameters in the model?

Try it now

Loading the interactive demo…