Explain what RMSNorm does to a hidden vector and why pre-norm makes deep transformers trainable.

Each layer adds its token-mixer (attention or GatedDeltaNet) output and MLP output back into the same residual stream — without normalization the magnitudes drift across 24 layers, softmaxes saturate, and training breaks. RMSNorm's whole job is to put the brakes on.

7 min

forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling

Glossary · 8 terms

RMSNorm: Divide x by sqrt(mean(x^2) + eps), then rescale element-wise by a learned gain g. No mean-centering, no bias.
LayerNorm: The older sibling: subtracts mean(x) first, then divides by std, with both gain and bias. RMSNorm drops the centering and bias.
pre-norm: Architecture choice: normalize the residual stream before each sub-block. Residual stream itself stays un-normalized.
post-norm: The original 2017 transformer pattern: residual first, norm last. Easier to interpret, much harder to train deeply.
variance / std: Variance = the average squared distance from the mean. Std (standard deviation) = the square root of the variance. LayerNorm divides by std; RMSNorm skips the mean entirely.
L2 norm: sqrt(sum of squares) of a vector. A standard magnitude proxy; RMSNorm makes the per-token L2 land near sqrt(hidden_dim).
learned gain (g): A per-feature scale RMSNorm applies after dividing by RMS. The only learned parameter the normalizer carries.
gradient path / identity path: The route the training signal takes backward through the network. The residual highway is an identity path because it passes that signal through unchanged (multiplies by 1), so it doesn't shrink toward zero (vanish) across many layers.

RMSNorm: keeping activations in check

One sentence: each of Qwen3.5's 24 layers adds its token-mixer (attention or GatedDeltaNet) output and MLP output back into the same residual stream — without something holding magnitudes in check, the hidden vector grows until softmaxes saturate and gradient updates stop being meaningful. RMSNorm is the cheap, almost-stateless trick that puts the brakes on. The chart on the right shows a real prompt: the residual still climbs roughly an order of magnitude over depth (visible bottom line), but the input to each sub-block is held flat near √hidden_dim by RMSNorm (pink line).

Before the formula, feel the problem the brakes solve. “Softmaxes saturate” sounds abstract — this widget makes it concrete. Push a token's magnitude up and watch a downstream softmax collapse from a spread-out distribution onto a single option, going blind to the rest:

When magnitude drifts, softmax saturates

Same score pattern, swept by overall magnitude

Six fixed scores (one per token). Their pattern never changes — only the overall magnitude does. Drag it up and the downstream softmax stops spreading attention and collapses onto a single token.

per-token magnitude (L2)32 (logit scale 1.0×)

√1024 ≈ 32 · informativedrifted · saturated

informative — attention is spreadtop = 0.325

token A

0.325

token B

0.218

token C

0.161

token D

0.132

token E

0.098

token F

0.066

Bigger activation magnitude scales every score up by the same factor, which makes softmax spike onto the single largest option and ignore the rest — the distribution saturates. A near-one-hot distribution carries almost no information about how the other options compare, and the gradient into them all but vanishes. RMSNorm pins each token's magnitude near √hidden_dim ≈ 32 (here √1024 for Qwen3.5-0.8B), keeping the downstream distribution informative instead of letting drift collapse it.

Illustrative — a schematic six-way softmax over fixed synthetic scores, not a literal model readout. The real saturation this prevents happens in the attention scores. The softmax itself is exact: scaling the scores by a larger factor genuinely sharpens it toward one-hot.

The formula, in two lines

RMSNorm divides each token's hidden vector by its root-mean-square, then rescales it with a learned per-feature gain g:

RMS (x) y_{i} = mean (x^{2}) + ε = \frac{x _{i}}{RMS ( x )} \cdot g_{i}

Work it once by hand with four numbers (ε is too small to matter, and g = 1 before training). Take x = [2, −1, 3, 0]. Square each entry: x² = [4, 1, 9, 0]. Mean: (4 + 1 + 9 + 0) / 4 = 3.5. So RMS = √3.5 ≈ 1.87, and dividing gives y ≈ [1.07, −0.53, 1.60, 0]. Big entries shrink, the zero stays zero, and the vector keeps its shape — only its size changes.

Here ε (epsilon) is a tiny constant so we never divide by zero, and the learned gain g is the only parameter the normalizer carries — one scalar per feature, length equal to hidden_dim. Divide by RMS first to make the vector's average squared magnitude land at 1. That is what anchors the vector's L2 norm — the L2 norm is just the vector's length, the square root of the sum of its squared entries — near √hidden_dim:

after RMSNorm the mean of the squared entries ≈ 1;
so the 1024 entries' squares sum to ≈ 1024 (here Σ means “sum over all the entries”, so Σx² ≈ 1024);
so the vector's length L2 = √(Σx²) ≈ √1024 = 32 — that is where the “about 32” comes from.

Then let g reweight each feature however the model wants. Initially g = 1 for every feature, so before training RMSNorm just standardises magnitudes; during training the model learns which features matter more than others and bakes that into g.

Learned gain γ — the only trained parameter

y_{i} = \frac{x _{i}}{RMS ( x )} \cdot g_{i}

RMSNorm's only learned parameter is this per-feature gain — one scalar per feature (length 1024, initialised to 1.0, no bias). Training learns which features to amplify and which to damp.

Per-feature gain (sample of 48)learned γ

γ = 1.0

↑ amplified (γ > 1)↓ suppressed (γ < 1)

Amplified

26 features

Suppressed

22 features

Min γ

0.39

Max γ

1.84

With the learned gain, each feature is re-weighted independently — 26 amplified (γ > 1) and 22 suppressed (γ < 1) in this sample.

Illustrative — γ is a synthetic sample of 48 of the 1024 features (a fixed seeded draw centered on 1.0), not Qwen3.5-0.8B's actual learned gain. The shape — most features near 1, a few amplified or suppressed — mirrors real norms.

Side-by-side with the older LayerNorm, the simplification is obvious:

# RMSNorm — what every modern LLM uses
y = x / sqrt(mean(x²) + ε) * g
                              ↑
                       single learned vector

# LayerNorm — what the 2017 transformer used
y = (x - mean(x)) / sqrt(var(x) + ε) * g + b
     ↑—————————                  ↑
   subtract mean first        plus learned bias

Two words in that LayerNorm line, since the demo's stat boxes use them too: var(x) — the variance — is the average squared distance from the mean, and std (standard deviation) is its square root. Dropping the mean-subtraction and bias is the entire difference. Llama, Mistral, Qwen, DeepSeek — every open LLM you'll meet uses RMSNorm; the simpler formula is empirically a wash on quality and modestly faster to run.

Pre-norm vs. post-norm

The diagram below shows where the norm lives inside a layer. Modern stacks all use pre-norm: the residual highway stays un-normalized, so gradients flow through it as a clean identity path. Post-norm (the 2017 original) places the norm on the residual highway — fine for 6 layers, painful at 24.

Post-norm (2017 original)

Norm sits on the residual highway. Gradients have to fight through one extra norm at every layer. Trains fine for 6 layers; painful past 20.

Pre-norm (Qwen3.5, modern LLMs)

Norm only touches the sub-block input. The residual highway stays un-normalized — gradients flow through a clean identity path. Stable to 24+ layers.

Why does post-norm get painful at depth? A norm doesn't just rescale the activations flowing forward — it also rescales the learning signal flowing backward during training. With post-norm that signal passes through a norm in every one of the 24 layers, and 24 small rescalings multiply up into a big one. Pre-norm's identity path skips all 24.

What the demo shows you

The L2/tok across layers chart (top of the demo panel) is the punchline of this chapter — one curve drifting up, one curve held flat. The per-layer breakdown below it lets you spot-check a single layer's stats. And the scale-invariance playground at the bottom is a synthetic 256-dim vector you can stretch by 0.1×–10× to convince yourself the output magnitude really doesn't depend on the input scale.

Engineering takeaways

RMSNorm collapses input magnitudes to roughly sqrt(hidden_dim) before the next sub-block reads them — about 32 for Qwen3.5-0.8B.
Pre-norm keeps the residual stream un-normalized so gradients flow through an identity path; that's why 24-layer stacks train at all.
Dropping LayerNorm's mean-centering and bias is essentially free quality-wise but slightly cheaper to compute — every modern open LLM does it.

Try this

After the auto-run, use the Layer selector to flip between layer 0 and layer 22. Compare the L2/tok numbers for 'Input RMSNorm input' vs 'Input RMSNorm output'. How does each value change with depth, and which one stays roughly constant?

Quick check

1. What does RMSNorm drop compared to the older LayerNorm?

The learned gain g.The mean-centering step and the additive bias — RMSNorm only divides by RMS and applies a learned gain.The division step entirely; it only rescales.

2. Why is pre-norm the standard choice for 20+ layer transformers?

Gradients flow through the un-normalized residual path as a clean identity, so deep stacks stay trainable.Pre-norm uses fewer parameters than post-norm.Pre-norm avoids needing a residual connection.

3. After RMSNorm and the learned gain, roughly what magnitude does a token's hidden vector L2 land near?

Roughly 1 — RMSNorm normalises every vector to unit length.Roughly sqrt(hidden_dim), scaled by the learned gain — about 32 for a 1024-dim hidden state.Whatever the input magnitude was, unchanged.

Explain what RMSNorm does to a hidden vector and why pre-norm makes deep transformers trainable.

RMSNorm: keeping activations in check

The formula, in two lines

Pre-norm vs. post-norm

What the demo shows you

Try it now