Explain why the KV cache exists, how it grows with context, and what Qwen3.5's hybrid attention buys you in memory.
Without caching K and V, each decoded token would redo all of attention from scratch — making generation quadratic in context length.
Glossary · 7 terms
- KV cache
- Per-layer stored K and V tensors for every prompt and generated token, reused across decode steps so attention stays roughly linear.
- prefill
- First inference phase: the whole prompt is processed in one matmul-heavy pass that writes K and V for every prompt token.
- decode
- Per-token generation phase: only the new token's Q/K/V is computed, and Q attends over the cached K/V history.
- GatedDeltaNet
- Qwen3.5's linear-attention variant. Stores a fixed-size recurrent state per layer instead of a token-by-token KV cache.
- hybrid attention
- Mixing full softmax-attention layers and linear-attention layers in the same stack. Qwen3.5 uses 6 full + 18 linear.
- linear state
- Per-layer tensor of shape [16, 128, 128] (linear value-heads × value-dim × key-dim) that compresses the entire history; size is independent of sequence length.
- recurrent state (RNN-style)
- A fixed-size memory the model overwrites and rolls forward at every step — like a running total or moving average — instead of keeping every past item. Linear-attention layers use one so their memory stays constant no matter how long the context grows.
KV cache & hybrid attention
Every chapter so far has described what happens for one forward pass. But generation is a sequence of forward passes — one per output token — and naively each pass would redo all the attention work for every token in the history. That would make decode quadratic in the context length, which is obviously hopeless once you have a 32k-token document in the prompt. The fix is a piece of engineering, not a new architecture: cache the parts that don't change between steps. That cache is what dominates inference memory and shapes most of the architectural decisions you see in modern LLMs.
Prefill versus decode
Inference has two phases. Prefill takes the whole prompt at once and computes K, V and the hidden states for every token in one big matmul-heavy pass. Decode then generates one token at a time: feed the previous output back in, compute K/V/Q for just that single new token, and read the K/V of all earlier tokens out of a cache instead of recomputing them. Without the cache, each decode step would be as expensive as prefill — with it, decode cost is roughly linear in the cache size.
Why caching K and V works
At decode step t, every layer needs the attention output for the new token: softmax(q_t · K^T / √d) · V — the same attention formula from Chapter 4. The Q is brand new (it comes from the just-computed hidden state). The K and V for the new token are also new. But the K and V for tokens 0..t-1 were already computed in earlier steps — and they don't depend on token t. So we save them once and read them back forever.
Shape-wise, the cache stores, per layer, two tensors of shape [num_kv_heads, seq_len, head_dim]. As seq_len grows, the cache grows linearly. That's the catch: cache memory is proportional to context length, and for long contexts it dwarfs the weights.
Prefill processes all 5 prompt tokens together and writes their K/V into the cache.
The KV cache is why decode stays affordable. With it (green), each step computes Q/K/V for just the single new token instead of re-projecting the whole prefix — so the positions it (re)processes stay flat at 1, and across n tokens that is linear rather than quadratic. The new token still attends over every cached key, so that attention read does grow with context — the cache removes the prefix re-projection, not the growing attention. Without it (red), every step re-runs the entire prefix from scratch, so the positions processed climb each step and the total work is quadratic in the sequence length. The running totals above make the gap concrete: after the three decode steps here, cache ON has processed 5 + 1 + 1 + 1 = 8 positions, cache OFF 5 + 6 + 7 + 8 = 26 — and the gap widens with every extra token.
Scripted to show the loop's shape and the cache-vs-recompute cost — not live output from the model.
The arithmetic for Qwen3.5-0.8B
Per full-attention layer, per token, the cache costs 2 · num_kv_heads · head_dim · 2 bytes = 2 · 2 · 256 · 2 = 2048 bytes. At a 32k-token context, one full-attention layer holds 64.0 MB. If every one of the 24 layers were full-attention, the whole-model cache would be 1.50 GB — for the smallest Qwen3.5 variant. Bigger Qwen3.5 sizes scale from there.
This is why KV cache memory, not weights, is the headline number for modern serving. It's also why a generation of architectural decisions — multi-query attention, grouped-query attention, sliding windows, and linear-attention variants — have all been driven by cutting the cache without giving up too much quality.
Qwen3.5's answer: hybrid attention
Qwen3.5 interleaves two different kinds of attention layer. Six of its 24 layers (one in every four, at indices 3, 7, 11, 15, 19, 23) are conventional full-attention layers — the ones from Chapter 4, with the linear-in-seq_len KV cache. The other 18 layers use a linear attention variant called GatedDeltaNet — "linear" because its cost grows in proportion to the sequence length rather than with its square, the way full softmax attention does. Instead of caching K and V per token, a GatedDeltaNet layer compresses the entire history into a fixed-size recurrent state of shape [16, 128, 128] (linear value-heads × value-dim × key-dim — distinct from the 8 full-attention heads of head-dim 256). The state is rolled forward at each step the way older recurrent networks (RNNs) kept a running summary of everything seen so far — independent of how many tokens have streamed by.
With 16 linear heads at head_dim = 128, each linear layer's state is a constant 512.0 KB regardless of context. So total cache for Qwen3.5-0.8B is 6 × full + 18 × constant — the chart on the right plots that against a hypothetical Qwen with full attention on every layer.
This is also what makes the model's full 262.1k-token window affordable. Imagine pasting a 200-page novel — call it 262.1k tokens — into the prompt. If all 24 layers kept a full token-by-token KV cache the way the 6 attention layers do, the cache would track the amber GQA-every-layer curve — on the order of ~4× the hybrid’s few-gigabyte footprint at that length (the cache still grows linearly with context either way; the hybrid just keeps the slope down). Instead, the 18 linear layers each hold a flat 512.0 KB at any length, so the window can grow roughly 8× past the old 32.8k mark without the cache exploding. Only the 6 full-attention layers still pay the per-token price.
Here is that contrast at the token level: feed the same stream through both kinds of memory and watch one append a box per token while the other updates a single slot in place.
Tokens fed: 0 of 7. Full attention memory: 0 stored entries (grows one per token). Linear running state: 1 fixed slot (constant), current value S = 0.00.
That single running slot is a recurrent neural network's hidden state: a fixed-size memory rolled forward each step (S ← decay·S + u), like a running total or moving average. Full attention instead keeps every past token, so its memory grows without bound. That is the memory-vs-recall trade-off — a high decay remembers further back but still blurs the past together, so it can never pull out one exact old token the way the full cache can.
Illustrative — a scalar cartoon. GatedDeltaNet's real state is a matrix per head — shape [16, 128, 128] (value-heads × value-dim × key-dim) for the 0.8B model — and its real update is a gated delta rule, S ← g·S + β·(v − (g·S)·k)·kᵀ: it decays the old state by g and writes a β-weighted correction toward the new value, not a plain outer product. But the fixed-size-vs-growing contrast is exactly right.
The trade-off
Linear attention is approximate. It cannot perfectly recall an arbitrary single token from way back in the history the way full softmax attention can — its compressed state mixes everything together. What it does well is the bulk of language processing: pattern continuation, syntactic flow, local style. Interleaving with the six full-attention layers brings the exact-recall capability back where it's needed. Concretely: linear layers run the long- context flow, full layers handle "look back to that one specific token" jobs.
You have probably felt the human side of this: the model loses the thread in a long chat — you mention a name early on, and forty turns later it has slipped. Two different things can cause that. First, you may have scrolled past the 262.1k-token ceiling, in which case the early turn has fallen out of the window entirely and is simply gone. Second — the more interesting case — the turn is still in-window, but the detail only ever lived in the 18 linear layers' blurred recurrent state, exactly the smeared distribution the panel below makes visible, with the 6 full-attention layers as the occasional safety net that can still pin it down. Neither failure is absolute: a name repeated often enough survives the blur, and even past the ceiling a summary the model wrote earlier can carry the fact forward.
Softmax attention scores the needle against every cached key directly, so it can put almost all the mass on the one exact token it was asked to recall.
A fixed-size recurrent state compresses the whole history into one constant-size tensor; here that mixing leaves the needle only barely ahead of look-alike distractors, rather than pinned down.
Full-attention layers keep direct, per-token access to every cached key, so they are well suited to pulling out one exact token on demand. A fixed-state recurrent layer (GatedDeltaNet) compresses the whole history into a constant-size tensor, so under recall pressure it is less reliable at pinning down an arbitrary token — as in this illustrative case, where the needle barely leads its look-alikes. Neither is absolute (attention can still mis-attend, and a recurrent state isn't always blurred), but that reliability gap is why Qwen3.5 keeps 6 full-attention layers for the "find that one specific token" jobs its 18-layer linear majority handles less precisely.
Illustrative distributions — hand-picked to show the exact-vs-blurred shape, not live output from the model.
Why this is the future
As context lengths push past 1M tokens, quadratic-memory attention becomes untenable — both as cache and as compute. Hybrid attention is one of the techniques the field is exploring. Pure state-space models like Mamba — models that replace attention entirely with a rolling state like the GDN slot above — take it further still. The trend is clear: cache is the bottleneck, and architectures will keep getting reshaped around it.
People sometimes picture it as a hashmap from token text or position to (k, v) tuples. It isn't. The cache is a pre-allocated tensor of shape [layers, 2, batch, kv_heads, max_seq, head_dim] (the 2 is the key/value pair — one slab for K, one for V) that each attention layer writes into and reads from, with a position counter telling it how much of the tensor is live. The "key" is the position index, not a hash. It's a buffer, not a hashmap.
Prefill vs decode, in time
Memory is one axis; wall-clock time is the other. The chart below contrasts prefill (one parallel matmul over the whole prompt) with decode (many small sequential matmuls, one per generated token).
Why is the per-token gap so huge — roughly 0.7 ms against 200 ms, about 293×? A decode step has to read every weight in the model out of memory and gets just one token for the trip: it's memory-bound, with nothing to parallelize over. Prefill reads the same weights once and applies them to all 1,024 prompt tokens at the same time, so the cost of the trip is split 1,024 ways.
The hybrid layout adds one more wrinkle here. The linear layers' recurrent state must update strictly one token at a time, in order — step t's state can't be computed without step t-1's — while a full-attention layer has no such chain and could score every position at once. So the constant memory comes partly at the cost of some of attention's decode-time parallelism. (This is a decode-only concern: at training time linear attention has parallel-scan forms that recover the lost parallelism.)
Prefill is one parallel matmul; decode is N small sequential matmuls. That's why your first token comes fast and the rest stream. Per token: ~0.7 ms in prefill vs ~200 ms in decode — roughly 293× apart.
How the cache grows with context
And here, plotted in isolation, is the cache-vs-context curve for each layout. The formula is unpacked underneath: every factor in 2 · L · kv_heads · d · seq · bf16 is a knob someone is working on right now.
bytes_per_full_layer = 2 · kv_heads · d · seq · bf16
= 2 · 2 · 256 · seq · 2
bytes_per_linear_layer = constant (512.0 KB)
total = 6 · bytes_per_full_layer + 18 · bytes_per_linear_layerhead_dim.The right-hand widget plots the cache curves for MHA, GQA-only, and Qwen3.5's hybrid layout; drag the context slider to read off memory at any length. The layer strip below it shows the exact interleave — click any layer to see what its cache looks like at the current context length.
- KV cache memory scales linearly with context, and at long context it dwarfs the model weights themselves.
- Qwen3.5's 6 full + 18 linear layout collapses the cache curve because linear layers contribute a constant, not a per-token chunk.
- Full-attention layers handle 'recall this specific token' jobs; linear layers handle the bulk of language flow under a tight memory budget.
Drag the context-length slider from 1k to 131k while watching the three KV cache curves. At 131k tokens, roughly how many times smaller is the hybrid 'actual' curve than the MHA hypothetical curve, and which curve overlaps the hybrid line at short contexts?