Explain how RoPE injects token order into attention via per-pair rotations, without learned position embeddings.

Self-attention is permutation-invariant — without a positional signal the model cannot tell 'cat sat on mat' from 'mat sat on cat'.

8 min

forward passTokenizeEmbedding lookup× 24 layersFinal RMSNormLM headSampling

Glossary · 6 terms

RoPE: Rotary positional embedding. Rotates each (x_2i, x_2i+1) pair of Q and K by an angle m * theta_i before the dot product.
pair frequency (theta_i): Per-dimension rotation rate: theta_i = base^(-2i / rope_dims). Low i rotates fast, high i rotates slowly.
rope_theta (base): The base of the frequency exponent. Qwen3.5 uses 1e7, far higher than the original 1e4 — stretches the spectrum for long context.
partial rotary factor: Fraction of each head's dims that RoPE rotates. Qwen3.5 uses 0.25, so 64 of 256 head dims are rotated; the rest pass through.
relative-position property: The dot product of RoPE(q,m) and RoPE(k,n) depends only on m-n, so the model only has to learn relative offsets.
unitary rotation: A rotation preserves vector norms — RoPE changes where vectors point, never how big they are.

Positional encoding: how the model knows token order

Self-attention has a strange property: it is permutation-invariant. If you shuffle the input tokens, the attention output for each token shuffles right along with them, but the relationships the attention layer computes don't change. Without help, a transformer literally cannot tell "the cat sat on the mat" from "mat the on sat cat the". Something has to inject the order back in.

See it for yourself first. Below is a toy bag of three tokens with no position attached — reorder them and the attention weights never budge:

Plain attention can't see order

Reorder the bag — the weights don't move

Current order · cat · sat · mat

How much cat attends to each token · softmax(content · content)

cat0.46

sat0.36

mat0.17

Same weights, just relabeled positions. Shuffle the order and the bar for cat → every token holds its exact value — the slots above moved, the numbers did not.

Plain attention only sees which tokens are present, not their order — swap the order and every weight is unchanged. That blindness is exactly the gap RoPE fills by rotating Q and K by position.

Illustrative — three fixed tokens with hand-picked 2-D content vectors and no positional signal. The weight from one token to another is softmax of the dot product of their content vectors, which depends only on the two tokens, never on their slots — so it is the same across every ordering. The real model adds many dimensions, learned Q/K projections, and the RoPE rotation that breaks exactly this symmetry.

Two earlier ideas

Learned positional embeddings. Early GPT, BERT, and friends gave every position 0..N-1 its own learned vector andadded it to the token embedding. Simple, but the model can't extrapolate to positions it never saw during training, and you have to pick a max length up front.
Sinusoidal embeddings. The original 2017 transformer used fixed sinusoids of different frequencies, also added to the token embedding. Extrapolation is in principle possible, but in practice quality degrades past the training window.

RoPE's idea: rotate, don't add

Rotary Positional Embedding (RoPE) leaves the embeddings alone and instead bakes the position directly into the attention math. The trick is to rotate the query and key vectors by an angle that depends on the token's position,before the dot product is computed.

Concretely, RoPE pairs up adjacent feature dimensions — (x_0, x_1), (x_2, x_3), … — and treats each pair as a point in a 2D plane. For pair index i and rotary dimension d (the number of features RoPE actually rotates — not necessarily the full head dimension; more on that just below), the frequency is:

θ_{i} = base^{- 2 i / d}

Reading that in words: as the pair index i grows, the exponent −2i/d gets more negative, so $θ_{i}$ gets smaller. Low-index pairs spin fast (they track fine, local position); high-index pairs spin slowly (they track coarse, long-range position).

At token position m, the pair is rotated by an angle of $m \cdot θ_{i}$ :

A rotation matrix just spins a 2-D point — think of a clock hand — by an angle $θ$ ; the four trig entries below are simply the recipe for computing where the spun point lands. The clock-hand dials further down show exactly this spin. Angles here are measured in radians — a unit where one full turn is $2 π \approx 6.28$ , so a rate of "1 rad/token" is about a sixth of a turn per step.

(x_{2 i}^{'} x_{2 i + 1}^{'}) = (cos (m θ_{i}) sin (m θ_{i}) - sin (m θ_{i}) cos (m θ_{i})) (x_{2 i} x_{2 i + 1})

Run that matrix once by hand, on pair 0 — whose frequency works out to exactly θ₀ = base⁰ = 1 radian per token, so pair 0 spins 1 rad every step. Take the pair (1, 0) at position m = 2: it rotates by 2 rad, landing at (cos 2, sin 2) ≈ (−0.42, 0.91). Same length, new direction — and it is exactly the spin you will see on dial 0 (the high-frequency clock) further down.

The pairs are drawn here as neighbours (x_0 with x_1) for clarity; how Qwen3.5 actually lays them out is an implementation detail you can skip on a first read (see Advanced below).

Advanced: how the pairs are actually laid out (rotate-half) · optional, for the curious

The pairs are drawn here as neighbours (x_0 with x_1) for clarity, but Qwen3.5 — like most Llama/HF-style models — uses the mathematically equivalent rotate-half layout: dimension i is paired with its counterpart i + d/2 in the second half of the rotated block. Same angles, same frequencies, same relative-position property; only which two dimensions share a rotation differs.

Rotation is unitary — it preserves vector norms — so RoPE doesn't change how big Q and K are, only where they point. The position information rides on the angle, not the magnitude.

Wide range of frequencies

Low-index pairs (small i) get the highest frequencies and rotate fast — one full turn every few tokens. High-index pairs get vanishingly small frequencies and barely move even over tens of thousands of tokens. The intuition: high-frequency pairs encode fine local position ("am I 3 tokens from my neighbour?"), low-frequency pairs encode coarse global position ("am I in the early part of the document or the late part?").

Qwen3.5 uses a head dimension of 256, a partial rotary factor of 0.25 (so only the first 64 features per head are rotated — the rest pass through untouched), and an unusually large RoPE base of 1e7. The large base stretches the frequency spectrum so the lowest-frequency pairs barely rotate at all over the model's 262,144-token context window — a key ingredient for long-context extrapolation.

Why rotate only a quarter of the head? Because the unrotated 192 dims carry no position at all — they do pure content matching ("does this key mean what my query is asking for?"). Position only needs to ride on part of the vector; the rest is free to care about meaning alone.

One more wrinkle specific to Qwen3.5's hybrid design: this RoPE rotation is applied only on the 6 full-attention layers (those rotate the first 64 of 256 head dims, as above). The other 18 layers are linear (GatedDeltaNet) and encode position implicitly through their recurrence plus a short causal convolution — no RoPE at all.

The relative-position property

Here is the part that makes RoPE click. If you take the dot product of RoPE(q, m) with RoPE(k, n), the answer depends only on the difference m - n, not on m and n separately. The model never has to learn what "position 1,427" means — it only has to learn how attention should behave at relative offsets, which generalises naturally to lengths it has never seen.

The third panel on the right makes this concrete. Holding the query position fixed at m = 50 and varying the key position n, the rotated dot product peaks sharply at n - m = 0 and falls off (with ripples) either side as you move further apart. That fall-off is exactly the inductive bias that makes attention "want" nearby tokens more than distant ones, with no per-position parameters at all.

You can feel the offset-only dependence directly. Shift both positions together and the score never budges:

q·k depends only on the offset m − n

Shift both positions — the score stays put

query position m = 6

key position n = 2

Shift both together (Δ unchanged) — watch the arrows sweep while q·k holds

offset Δ

relative rotation

Δ·ω = 2.00 rad

q·k

-0.42

pulling apart · q·k = cos(Δ·ω) = cos(2.00) = -0.42

The lesson: the score q·k = cos(Δ·ω) depends only on the offset m − n — shift both positions together and it doesn't change.

Both arrows start from the same base vector; RoPE turns the query by m·ω and the key by n·ω, so the query's relative rotation is exactly Δ·ω = (m − n)·ω and q·k = cos((m − n)·ω). Increase m and n together and the whole picture rotates, but the gap — and the score — stays frozen. Move just one and the gap opens or closes: align them (Δ → 0) and q·k → 1.

Illustrative — this uses ONE representative frequency (ω = 0.5 rad/token) and simplifies the content vectors to a shared base so we isolate the positional part. As positions grow the relative rotation Δ·ω can exceed a full turn, but because cosine repeats every turn the score still depends only on the offset Δ. The real model combines this with the actual query and key content across many frequency pairs at many different speeds (see the frequency spectrum below).

This chapter visualises RoPE's math directly — no model inference needed. Want to see how this plays out on real attention scores? Open Self-attention (Chapter 4) and look at how the score map favours the diagonal.

Engineering takeaways

RoPE encodes position as rotation, not addition — magnitudes stay constant and only angles carry the position signal.
Low-index pairs are high-frequency (fine local position); high-index pairs are low-frequency (coarse global position).
The dot product after RoPE depends only on m - n, which is why models with RoPE extrapolate to lengths they never saw in training.

Try this

Drag the 'Token position m' slider from 0 up to 100. Compare the high-frequency rotation panel (pair 0) with the low-frequency one (pair 28). Which one completes a full turn first, and by how much has the other moved when m=100?

Quick check

1. What problem does RoPE solve that pure self-attention cannot?

Self-attention output magnitudes are too large.Self-attention is permutation-invariant — it has no notion of token order until something injects it.Self-attention cannot represent negative values.

2. Which dimension pairs end up encoding 'coarse global position' under RoPE?

High-index pairs — they have the lowest frequencies and rotate slowly enough to remain distinguishable over long contexts.Low-index pairs, because they have the highest frequencies.It's random — RoPE assigns roles per training run.

3. Why is it called the relative-position property?

The dot product of RoPE'd Q and K depends only on the offset (m - n), not on the absolute positions m and n.RoPE positions are stored relative to a chosen anchor token.The rotations always relate two adjacent tokens.