Latency, throughput & batching

Go deeper · Chapter 12, KV cache — how “fast” is actually measured, and how a server turns one memory-bound decode into throughput for many.

The KV cache is why decode is fast: each new token reuses the keys and values it already computed instead of redoing the whole prompt. But “fast” is two different numbers depending on what you mean, and a real server has to serve not one user but hundreds at once. This sub-chapter is the serving layer: first how speed is measured, then how batching turns the memory-bound decode you watch in the demo into throughput.

Two clocks — and why the average lies

When output is streamed token-by-token, two clocks matter and they measure different things. TTFT (time to first token) is the pause before the first word appears — it's set by the compute-bound prefill digesting your whole prompt in one pass. TPOT (time per output token, also called ITL, inter-token latency) is the steady streaming speed after that — set by the memory-bound decode. They trade off and you tune them separately: an ITL of 10 ms is 100 tokens/sec for that one user. (For a non-streamed call — an agent making a tool call and waiting for the whole JSON — neither clock is what you feel; you measure total response time instead.)

The trap is reporting either one as an average. Inference latency is right-skewed: most requests cluster near the typical case, with a long slow tail of unlucky ones. The mean gets dragged rightward by that tail and hides it — so engineers report percentiles instead. P50 is the median (half are slower), P90 is one-in-ten, P99 is the worst one in a hundred. Reliability work targets P90/P99, because the tail is what users actually remember. Watch both clocks, and watch the average lie:

The two clocks of inference speed

Streamed output — two metrics

ITL 10 ms = 100 tokens/sec per user

Two clocks: TTFT is the pause before the first word (the compute-bound prefill); after that the words stream at a steady ITL / TPOT (the memory-bound decode). TPS is ambiguous — “perceived TPS” is a per-user feel, “total TPS” is the whole service's throughput. For a non-streamed call (an agent tool call) you measure total response time instead.

Latency is right-skewed — the average hides the tail

P50 = 1 in 2 slower · P90 = 1 in 10 · P95 = 1 in 20 · P99 = 1 in 100

Latency is right-skewed: most requests cluster near the mode, with a long slow tail. So the mean sits to the right of the median — the average is dragged up by the tail and hides it. Engineers report percentiles: P50 (1 in 2 slower), P90 (1 in 10), P95 (1 in 20), P99 (1 in 100). Reliability work targets P90 / P99.

inference-time vs end-to-end

This in-browser demo measures the rawest TTFT and TPOT — no network, no queue, batch 1, one user. Production reports p95 / p99 end-to-end (on-GPU time plus network plus queue), because under load the slow tail is what users remember. When inference is fast but end-to-end is slow, fix the infrastructure, not the model. Your demo is the floor; production lives in the tail.

Illustrative round numbers — teaching the shapes, not this model’s measured latency.

Batching — one weight-read, many tokens

Now the lever. Recall the roofline: a single decode step streams all of the model's weights from memory exactly once, and at batch 1 that entire sweep produces a single token — the far-left memory cliff. The fix is almost embarrassingly direct. Put N requests on that same weight-read and the GPU emits N tokens for one memory sweep. Total throughput climbs with the batch while the memory traffic stays flat — batching is the cure for the memory wall, and the way a decode workload climbs the roofline toward the ridge.

Servers differ in when they launch a batch. Static batching waits until the batch is full, so the first arrival pays for the last. Dynamic batching launches when it's full or a timeout fires, whichever comes first. Continuous (a.k.a. in-flight) batching — what vLLM and SGLang do — swaps new requests into free slots at the token level, so each one starts almost immediately and the queue stays tiny. Toggle the three and watch the request lanes re-time:

Batching — three schedules, one trade-off

schedule:

in-flight · vLLM / SGLang

Also called in-flight batching (TensorRT-LLM's term). New requests slot into free seats at the token level, so each one starts almost immediately and the queue stays tiny. This is what vLLM and SGLang do.

queuerunning

mean wait + run: 30 ms

Why batching cures the memory wall

One decode step streams all the model's weights from memory once. Put N requests on that same weight-read and the GPU emits N tokens for one memory sweep — total throughput climbs while the memory traffic stays flat. That is the cure for the memory-bound decode this course already taught.

The knob: latency ↔ throughput

batch size1

total throughput

56tok/s↑ rises, then saturates

per-user latency

18ms / token↑ rises with the batch

🔒 you are here: batch = 1 · single user · zero batching

Your in-tab copy vs a shared API

This in-browser demo runs as a single user at batch = 1 — the worst case for the memory-bound decode, because that one expensive weight-read carries exactly one token. A shared production API weaves dozens of strangers into one continuous batch, amortizing the same weight-read across all of them. That is why a hosted model feels fast under load while your private copy stays bandwidth-bottlenecked. Same model, opposite economics.

Illustrative timings on a synthetic millisecond axis and a toy throughput/latency model — they teach the shape of each schedule, not a real benchmark.

The catch is that batching is a knob, not a free win. Slide the batch up and two readouts move in opposite directions: total throughput rises and then saturates (you leave the memory-bound regime and hit a compute ceiling), while each individual user's latency rises (more lane-mates to share every step with). Picking a batch size is choosing a point on that latency-versus-throughput curve — fast for one user, or cheap per token across many.

Which is exactly why this in-browser tour feels different from a hosted API: you are a single user at batch 1, the worst case for a memory-bound decode, because that one expensive weight-read carries exactly one token. A shared production server weaves dozens of strangers into one continuous batch and amortizes the same read across all of them — same model, opposite economics.