Quantization

Go deeper · Chapter 12, KV cache — store each weight in fewer bits, and the same memory-bound decode runs lighter and faster.

Everything in this chapter pointed at one bottleneck: decode is memory-bound. Every token re-reads the entire model from memory, and at batch 1 that whole sweep buys a single token. The roofline put a number on it — you are pinned to the far-left memory cliff. Quantization is the most direct lever on that cliff, and it leaves the architecture untouched: same shapes, same forward pass — it just stores each weight in fewer bits, an approximate copy on a coarser grid. Fewer bytes per weight means less to stream, and the memory-bound decode runs lighter.

Fewer bits, smaller model, faster decode

A weight in bf16 — what this demo ships — takes 2 bytes. Drop to fp8 or int8 and it's 1 byte; int4 is half a byte. The footprint scales straight down with the bits, and because decode is memory-bound, so does the time to stream those weights. But it is not a clean 2× per level: overhead — dequantizing back to a compute format, imperfect bandwidth use — means the book quotes a more honest 30–50% speedup per precision level down. Step through the formats and watch all three bars — size, speed, quality — move together:

Quantization — fewer bits per weight, smaller & faster

precision:

🔒 you are here · the demo ships bf16

bf16 — ~lossless — it is the training precision

book fragility order (least → most): weights < activations < KV cache < attention

Mechanism — quantization snaps continuous weights onto a discrete grid through a scale factor (plus an optional zero-point): real ≈ scale × (q − zero_point). Weights use a symmetric grid, activations an asymmetric one; granularity can be per-tensor, per-channel, or per-block. Weight-only methods GPTQ (Hessian-based) and AWQ (named for activations, but it only picks the salient weight channels to keep precise) quantize the weights alone; SmoothQuant and LLM.int8() are W8A8 (weights and activations) — SmoothQuant migrates the hard dynamic range from activations into weights, LLM.int8() keeps the rare outlier dimensions in fp16.

Honesty for this demo: the in-browser model ships in bf16 — no quantization at all. Datacenter serving commonly runs fp8 / int8, while more aggressive integer and sub-4-bit quant (int4 and below) is mostly a local-inference practice (GGUF / Ollama / llama.cpp). So bf16 here is a defensible, honestly-labelled choice — but a quantized copy of this same model would be ≈2× (int8) to ≈4× (int4) smaller and a meaningful chunk faster to decode.

How it works — and what it costs

The trick is a scale factor: a block of weights is mapped onto a small grid of integer levels, and one shared float rescales them back at compute time (real ≈ scale × (q − zero_point) — where zero_point just shifts the grid so real 0 still lands on an exact code; more on this in the number-formats sub-chapter). Fewer bits means a coarser grid, so the cost is quality. The book ranks what tolerates it, least to most fragile: weights < activations < KV cache < attention. Weight-only methods like GPTQ and AWQ quantize just the weights and are often near-lossless down to int4; pushing to weights-and-activations (W8A8, e.g. SmoothQuant, LLM.int8()) is harder because activations carry wilder outliers. The rule of thumb: fp8 / MXFP8 is the sweet spot — almost no perceptible loss — while integer formats lack dynamic range, so for quality-sensitive work you stick with floating point and treat sub-4-bit as the cliff.

The KV cache can be quantized too

This isn't only about weights. The bf16 KV cache you met in this chapter is itself a stream of bytes the decode reads back every step — so it's a quantization target as well. It's moderately sensitive (the book's order again): its errors compound token to token, since each cached key/value is reused for the entire rest of the sequence. fp8 / MXFP8 is the safe choice there; aggressive integer KV quant is where quality starts to slip.

Honest footnote for this demo: the in-browser model runs in bf16 with no quantization at all — the simplest, highest-fidelity choice. The book notes integer / sub-4-bit quant is mostly a local-inference practice (the GGUF files you'd download for Ollama or llama.cpp), distinct from datacenter serving. A quantized copy of this same model would be ≈2× (int8) to ≈4× (int4) smaller and a real chunk faster to decode — the lever this whole chapter has been pointing at.