The memory wall

Go deeper · Chapter 4, Self-attention — how big softmax(QKᵀ) actually gets.

Look again at Q · Kᵀ: it is a [seq_len, seq_len] matrix — one score for every pair of tokens. For a 4,096-token prompt that is ~16.8 million scores per head, per layer. With 8 query heads that is ~134 million numbers for a single attention layer — about 268 MB in bf16 (2 bytes each). The model never needs all of that at once, yet the naïve recipe builds the whole thing and sets it aside.

First, two kinds of GPU memory

To see why that hurts, you need one fact about how a GPU is built. It has two very different places to keep numbers:

SRAM — a sliver of memory right next to the compute units, where the actual math happens. It is blazing fast but tiny (a few megabytes). Think of the small desk you work at.
HBM — the GPU's main memory, where weights and activations live. It is large (gigabytes) but comparatively far away and slow to reach. Think of a warehouse down the hall.

Anything too big for the desk has to be stored in the warehouse and carried back whenever you need it. The N×N score matrix is far too big for SRAM — so it lives in HBM, and every step that touches it pays for the trip.

The cost is bandwidth, not storage

Bandwidth is how fast you can move data between HBM and the compute units — not how much fits. The naïve attention recipe is brutal on it: it writes the full N×N matrix out to HBM, reads it back to run softmax, writes the normalized scores back, then reads them a third time to weight V. The matrix crosses the slow memory bus about four times, and the arithmetic in between is comparatively trivial. Attention spends most of its time shuttling that matrix, not computing — so we call it memory-bandwidth-bound.

Watch it happen — and drag the sequence length up to feel the N² blow-up:

The memory wall — why attention is bandwidth-bound

Q·Kᵀ is computed in fast SRAM — but the full N×N score matrix is far too big to keep there.

Sequence length N = 4,096 tokens

N×N matrix = 33.6 MB per head

moved over the bus so far: 0 B per head

≈ 0 B across all 8 heads in the layer

One caveat for honesty: this only bites during prefill — digesting a long prompt in one pass — and only on the 6 full-attention layers. Single-token decode (one new query row, the part you watch in the chapter demo) and the 18 GatedDeltaNet layers never build an N×N matrix at all.

That is exactly the problem the next sub-chapter sidesteps: computing the same answer without ever writing the N×N matrix to HBM.