Prefix caching

Go deeper · Chapter 12, KV cache — reuse one prompt's KV across requests, not just within a single generation.

The KV cache you met in this chapter is reused within one generation: each token of a single reply reuses the keys and values of the tokens before it. Prefix caching takes the same idea one level up — reuse that KV across separate requests that happen to begin with the same tokens.

The mechanism is exact. Two requests that share a leading prefix share the keys and values for those tokens, so the prefix is computed once and every later request skips prefill over it — a cache hit that drops its TTFT straight to the first new token. But the reuse ends at the first token that differs. The model is autoregressive, so one different token changes the hidden state — and therefore the KV — for everything after it, even text that looks identical further down. Past the first miss, the cache is worthless:

Prefix cache — reuse one prompt’s KV across requests

scenario:

Match — KV reused (prefill skipped)Miss — first differing tokenDiscarded — recomputedreuse: 2 / 4 tokens

Illustrative tokens — the concept, not this model’s real tokenization.

The two requests share the leading prefix Weather in. Its KV cache is computed once, so the second request skips prefill on those tokens — a cache hit that cuts its TTFT. The prefix ends at the first difference (SF vs NYC); the model is autoregressive, so every token after the miss has different KV too and is recomputed.

Big wins when a long prefix is shared across requests: long system prompts (agents, chatbots, tool calls), RAG / retrieved documents, and shared code context. Multi-turn chat reuses the whole conversation so far on every turn.

Honesty for this demo: as a single user at batch size 1, you have nothing to share a prefix with — every request re-prefills the whole prompt cold. A production API reuses one fixed system prompt across millions of requests, which is why pay-per-token APIs bill cache-hit tokens cheaper and feel near-instant on the boilerplate, only “thinking” on the new tokens.

That first-difference rule is also a design lesson: put novel tokens as late as you can. A long fixed system prompt followed by the user's question reuses the whole system prompt every time; the same content with a per-request id pasted at the front reuses nothing. The big wins are exactly the high-reuse shapes — long system prompts for agents and chatbots, RAG with shared retrieved documents, shared code context, and multi-turn chat (which re-sends the whole conversation so far on every turn).

And the honest footnote for this demo: as a single user at batch 1 you have nothing to share a prefix with — every request re-prefills the whole prompt cold. A production API reuses one fixed system prompt across millions of requests, which is why pay-per-token APIs bill cache-hit tokens cheaper and feel near-instant on the boilerplate, only “thinking” on the tokens that are actually new.