The roofline

Go deeper · Chapter 4, Self-attention — putting an actual number on “memory-bound.”

The memory wall kept leaning on one word — memory-bound — to explain why attention, and single-token decode in general, spends its time waiting on memory instead of computing. That was a story. The roofline turns it into a number you can plot, and that number tells you exactly when the story stops being true.

Two ceilings, and which one you hit first

A GPU can run out of road for two completely independent reasons, and whichever you reach first becomes your ceiling. One is compute — how many math operations per second the Tensor Cores can do (an H100 in fp16 tops out near 989 TFLOPS). The other is memory bandwidth — how many bytes per second you can stream out of HBM (the same H100: 3.35 TB/s). One caps the math; the other caps the bytes moved. A faster engine doesn't help if you're starved on fuel delivery.

Which ceiling binds you is a property of the workload, not the chip: its arithmetic intensity — the FLOPs of work done for every byte moved, measured in ops:byte.

intensity = \frac{work (FLOPs)}{memory traffic (bytes)}, ridge = \frac{peak compute}{bandwidth} = \frac{989 TFLOP/s}{3.35 TB/s} \approx 295 ops:byte

That ridge at ~295 ops:byte is the break-even point. Do fewer than 295 operations per byte and you finish the math long before the next bytes arrive — you are memory-bound, idling on the left of the ridge. Do more and the bytes arrive faster than you can crunch them — you are compute-bound, on the right. The roofline plots achievable performance against intensity: a diagonal bandwidth ceiling that climbs until it meets the flat compute ceiling exactly at the ridge.

Where prefill and decode land

The two halves of inference sit on opposite sides. Prefill digests the whole prompt at once, so every weight it reads is reused across hundreds of token positions in parallel — a mountain of matmul per byte. It sits comfortably right of the ridge, compute-bound. Decode at batch 1 is the opposite extreme: to produce one token it streams the entire model's weights through the chip exactly once, doing almost no math per byte. It falls off the far-left cliff, deeply memory-bound — the same idle-waiting-on-memory the rest of this course keeps running into.

Drag the batch slider and watch the decode dot march. Batching N requests lets all N reuse each weight read, so intensity climbs roughly ×N and the dot slides right up the diagonal toward the ridge — flipping from memory-bound to compute-bound only once there's finally enough math per byte to keep the FLOPs busy:

The roofline — putting a number on "memory-bound"

decode batch size: 1 (tokens computed per one weight-read)

intensity now: 1.2 ops:byte

verdict: memory-bound

Drag the slider: each token in the batch reuses the same weight read, so intensity climbs ≈ ×batch and the decode dot marches right toward the ridge.

Worked example — a batch-1 decode step does about 2 FLOPs of math per parameter (one multiply-add) but must read every parameter from memory (2 bytes in bf16) just to emit one token: ~2 FLOP ÷ 2 bytes ≈ 1.2 op:byte ≪ 295, pinned to the memory cliff. That ratio barely depends on model size — it's why every batch-1 LLM decode is memory-bound.

This in-browser demo is pinned at batch 1 on the far-left memory cliff — one user, plain autoregressive decode. A faster GPU's extra FLOPs can't help you here, because you're bandwidth-pinned, not compute-pinned. Production serving batches hundreds of users together to climb the diagonal up to the ridge — that's how a GPU's compute actually gets used.

So “memory-bound” was never a fixed property of the model — it's a property of how you run it. One user waiting on one token sits on the cliff; a server batching hundreds of users climbs the diagonal until compute finally becomes the thing worth optimizing. That climb is the whole subject of the batching sub-chapter later on — the roofline is the map it reads.