Chapter 12 · KV cache & hybrid attentionNumber formats & precision

Number formats & precision

Go deeper · Chapter 12, KV cache — open the hood on quantization: how a weight is stored as bits, what every format can and cannot represent, and how a real number gets snapped onto a coarse grid.

The previous sub-chapter used quantization as a lever: fewer bits per weight, less to stream, a faster memory-bound decode. This one opens the hood. What is a weight, down at the bits? Why does bf16 hold numbers as large as 10³⁸ while fp16 overflows just above 65,504? And what exactly happens when you “store a weight in fewer bits”? It comes down to two ideas — how a float splits its bits, and how a grid stands in for a continuous number.

A weight is just bits

A floating-point number is sign × mantissa × 2^exponent, and the bits split into those three fields. The split is the whole game: the exponent field buys dynamic range (how big and how small a value can get), and the mantissa field buys precision (how finely values are spaced). An integer has no exponent at all — it is a uniform grid, every step the same size, with no dynamic range of its own. That one contrast is the foundation of everything below:

Number line — integers are a uniform grid, floats are log-spaced
int8uniform256 levels0int4uniform16 levels0fp8log-spacedmax 4480fp16log-spacedmax 655040value / max (0 → that format’s own maximum)

Integers spread their levels evenly — the same step everywhere, and with no exponent there is no dynamic range. Floats spend bits on an exponent, so their levels cluster near zero (fine precision where the values actually are) and spread out toward the maximum (coarse, but a huge range). Same idea, opposite grid.

int8 = 256 levels · int4 = 16 levels · fp8 E4M3 max 448 · fp16 max 65504

Integers spread their levels evenly across the range. Floats cluster them near zero — where most weights actually live — and let them spread out toward the maximum. Same count of levels, opposite layout, and only the float carries any range. Spend a bit on the exponent and you can reach 10³⁸; spend it on the mantissa and you resolve a finer difference. You cannot have both for free.

The same 16 bits, two ways

The cleanest illustration is the two 16-bit formats. They have the same bit budget and split it oppositely — and that single choice decides what each is good for:

The 16-bit fork — same bits, opposite trade
🔒 you are here · the demo ships bf16
bf16exponent bias 127spends its bits on RANGE (8-bit exponent)1exponent 8mantissa 7range ≈ 1e38 (max ≈ 3.39e38)~2–3 decimal digits · matches FP32 rangefp16exponent bias 15spends its bits on PRECISION (10-bit mantissa)1exponent 5mantissa 10max 65504 (smallest normal 2^-14 ≈ 6.1e-5)~3.31 decimal digits · needs loss scalingexponent → range · mantissa → precision

Same 16 bits, opposite trade. bf16 keeps fp32’s full 8-bit exponent, so it has the same enormous dynamic range as fp32 (~1e38) — just coarser steps between values. That is why it won for training: gradients rarely overflow or underflow, so it typically needs no loss scaling. fp16 spends those bits on the mantissa instead (finer precision), but its 5-bit exponent gives a tiny range — so training in fp16 needs loss scaling to keep small gradients from underflowing to zero.

fp16 keeps a big mantissa (fine precision) but a small 5-bit exponent, so it tops out at 65,504 and underflows easily — training in fp16 needs loss scaling to keep gradients from vanishing. bf16 keeps fp32's full 8-bit exponent, so it matches fp32's enormous range (just with coarser steps) and rarely needs loss scaling. That is why bf16 became the default training precision — and it is exactly what this demo ships. You are here.

The format zoo

Below 16 bits the field split becomes a real design decision, so each bit-width ships in more than one flavor. Here is the whole ladder — click any row to read what it can represent:

The format zoo — one common bit scale
🔒 you are here · the demo ships bf16
signexponentmantissauniform integer grid
1 cell = 1 bit · all bars share one scale (fp32 = full width)fp32823tf32810fp16510bf1687fp8 E4M343fp8 E5M252int8256fp6 E3M232fp6 E2M323fp4 E2M12int416
bf1616 total bits
field split
sign 1 · exponent 8 · mantissa 7
max value
≈ 3.39e38
dynamic range
8-bit exponent
levels / precision
~2-3 decimal digits

typical use: dominant training precision (you are here)

Notice fp8 comes in two: E4M3 (more mantissa, max 448 — for weights and activations) and E5M2 (more exponent, max 57,344 — for gradients). It is the same fork as the 16-bit pair, one tier down — and fp8 training uses both halves at once:

The fp8 fork — eight bits, two opposite trades
🔬 fp8 ships both · E4M3 → forward (weights/activations) · E5M2 → backward (gradients)
E4M3exponent bias 7spends its bits on PRECISION (3-bit mantissa)sign 1exponent 4mantissa 3max 448 (OCP) · smallest normal 2^-6 ≈ 0.01563-bit mantissa → finer · no inf · 1 NaNE5M2exponent bias 15spends its bits on RANGE (5-bit exponent)sign 1exponent 5mantissa 2max 57,344 · smallest normal 2^-14 ≈ 6.1e-52-bit mantissa → coarse · IEEE-like (inf + NaN)exponent → range · mantissa → precision

Same 8 bits, opposite trade — and fp8 uses both. E4M3 keeps a 3-bit mantissa (finer precision) and reaches 448, so it carries the forward pass: weights and activations. E5M2 trades a mantissa bit for a 5th exponent bit, stretching the range to 57,344 (and keeping IEEE infinities/NaN) to hold the wide dynamic range of gradients in the backward pass. E4M3’s 448 max is the OCP/ML convention — a naive IEEE 8-bit float stops at 240; E4M3 reclaims the infinity and most NaN bit-patterns to extend it. fp8 lands on Hopper (H100) and Blackwell Tensor Cores; this browser demo still ships bf16.

Plot every format on the range-versus-precision plane and the reason two exist per width is obvious: at a fixed bit-width you have to pick a corner.

Range vs precision — every format is one point
floating pointintegers (uniform)
13103010002468dynamic range (approximate decades) →precision (decimal digits) →betterint8int4uniform grid · range comes only from an external scalefp32tf32bf16fp16fp8 E4M3fp8 E5M2fp6 E3M2fp6 E2M3fp4 E2M1same bit-width, opposite corner
bf16range ≈ 78 decades · precision ≈ 2.4 digits

Up-and-to-the-right is better — but it costs bits. At a fixed bit-width you must pick a corner: more exponent buys dynamic range (fp8 E5M2, fp6 E3M2), more mantissa buys precision (fp8 E4M3, fp6 E2M3) — the same total bits, opposite trade. Integers sit off this map entirely: a uniform grid has no inherent dynamic range, so its reach depends wholly on an external scale. Range numbers are approximate decades.

A footgun for fact-checkers: fp8 E4M3's max of 448 is the ML (OCP) convention — a naive IEEE-style 8-bit float would stop at 240. E4M3 reclaims the infinity and most NaN bit-patterns to extend its range, and the tiny fp6/fp4 formats drop infinities and NaNs entirely to win back a few precious code points. Subnormals (gradual underflow, where the implicit leading 1 becomes 0) buy a handful of extra tiny magnitudes at reduced precision — the smallest numbers each format can still name.

Snapping reals onto a grid

Quantizing a weight is just rounding it to the nearest tick of a grid. The grid spacing is the scale factor s; whatever is left over is the quantization error, at most half a step:

Quantization round-trip — snap a real value to the nearest grid tick
bits B
7 grid levelsstep s = 0.333−10+1real x = 0.420snapped x̂ = 0.333error = 0.087≤ s/2 = 0.167
step s = 1 / qmax, qmax = 2^(B−1) − 1 = 0.333
q = round(x / s) = 1
x̂ = s · q = 0.333
error = |x − x̂| = 0.087 (error ≤ s / 2 = 0.167)

The error is ~uniform on ±s/2, so its variance is s²/12.

Every quantized weight is the nearest grid point. The leftover — |x − x̂| — is the quantization error, and it can never exceed s / 2 (half a step). So more bits means a finer grid and less error — but more storage per weight. Here the grid is symmetric with a zero-point of 0; the next diagram adds a zero-point for non-centered data.

That is the entire mechanic: q = round(x / s) picks the nearest level, x̂ = s · q reads it back, and the error never exceeds s / 2. A coarser grid (fewer bits) means a bigger step and more error. But weights are not always centered on zero — and that is what the zero-point handles:

Symmetric vs asymmetric — what the zero-point really is
symmetric · zero-centered weights · int8 grid [−127, 127]real value-11real 0integer code-127127code 0weights ≈ zero-centeredzero-point z = 0 — no shift
real 0 (kept exact)zero-point (grid anchor)
symmetric (z = 0)
q = round(x / s)
x̂ = s · q
s = m / 127 = 1 / 127 ≈ 0.00787
asymmetric (affine)
q = round(x / s) + z
x̂ = s · (q − z)
s = (r_max − r_min) / (q_max − q_min) = 5 / 255 ≈ 0.01961
z = round(q_min − r_min / s) = round(1 / s) = 51

The zero-point is literally where the integer grid is anchored. Zero-centered weights map real 0 onto code 0 with no shift (symmetric, z = 0) — which also avoids an extra cross-term in the matmul, so weight-only quantization is the easy first step. Skewed activations (e.g. post-ReLU/GELU, mostly ≥ 0) need a nonzero zero-point (asymmetric / affine) so real 0 still lands exactly on a code — important for padding and ReLU. That asymmetry is why quantizing both weights and activations (W8A8) is the hard part.

Weights are roughly zero-mean, so they get a symmetric grid (zero-point 0). Activations are skewed and one-sided (think post-ReLU, all ≥ 0), so they get an asymmetric grid — the zero-point shifts the whole lattice so that real 0 still lands exactly on an integer code (q = round(x / s) + z). This is the deeper reason weight-only quantization is the easy first step and quantizing activations (W8A8) is the hard part: activations carry wild, input-dependent outliers that a single scale cannot hold.

One scale for how many weights?

A single scale for an entire tensor is cheap to store but loose — one outlier weight stretches the grid for everyone. Give each output channel, or each block of 32–128 weights, its own scale, and every grid fits its own values more tightly. That is the granularity ladder:

Granularity — one scale for how many weights?
weight matrix · one tint = one shared scale8 rows × 16 cols = 128 weightsscales stored11 scale over the whole tensoreffective bits/param4.00int4 + 16-bit ÷ tensor ≈ 4.00rounding errorordinal · not measured

per-tensorcoarsest — widest range per scale

The rule: finer granularity = lower rounding error, higher bit overhead. Each scale is fit to a smaller set of values, so its step size is tighter and rounding error drops — but you store more scales, and the scales cost bits too. That is why per-group-64 (4 + 16/64 = 4.25) and the MX per-block-32 (4 + 8/32 = 4.25) land at the same effective width by different routes: a fatter scale over a bigger block, or a leaner 8-bit E8M0 scale over a smaller one.

The MX (Microscaling) standard takes this to a fixed 32-element block with a shared 8-bit E8M0 power-of-two scale — so mxfp4 is int4-style 4-bit elements plus one tiny scale every 32 weights, a hardware-friendly fixed granularity.

Finer granularity means lower error — but the scales cost bits too, so a “4-bit” weight with a shared scale every 64 values is really about 4.25 bits. The Microscaling (MX) standard freezes this at a 32-element block, which is the bridge to the smallest formats.

Block formats borrow the range back

On its own, a 4-bit float can name only a handful of magnitudes — {0, 0.5, 1, 1.5, 2, 3, 4, 6}. That is useless until a block of them shares one scale that supplies the magnitude. This is why fp4 and fp6 are only ever block formats — and the same microscaling recipe also wraps fp8. Toggle the three the course names:

Block formats: a shared scale per block (the MX family)
32 × fp4 E2M1 elements (1 sign · 2 exponent · 1 mantissa)shared by the whole blockper-block scaleE8M0 scale8 exponent · 0 mantissa → a pure power of two2^-127 … 2^127 · 0xFF = NaN4.25 bits/elemvalue = element × 2^scaleone shared scale · v = X · Pᵢ

effective storage: 4.25 bits/elem4 + 8/32 = 4.25 — one 8-bit scale spread over 32 fp4 elements

Why a 4-bit element can never stand alone: fp4 E2M1 has just 1 sign · 2 exponent · 1 mantissa bit, so on its own it has almost no dynamic range — the code carries only a relative value. The shared scale supplies the magnitude, so the block as a whole spans a wide range while every element stays 4 bits. That is what microscaling (MX) means.

One recipe, three widths. MXFP8 wraps 32 fp8 values in one power-of-two E8M0 scale (≈ 8.25 bits/elem, near-lossless). MXFP4 uses the same 32-element E8M0 block on fp4 (≈ 4.25), where the shared scale is what makes 4-bit usable at all. NVFP4 tightens the block to 16 elements, swaps in a richer float E4M3 block scale, and adds a per-tensor FP32 scale on top (≈ 4.50). Smaller blocks and a float scale fit local data more tightly — less rounding error — for a touch more overhead. The error difference is real but qualitative; the course gives no measured number, so none is drawn here.

The OCP MX family applies one rule at three widths, always with a power-of-two E8M0 scale (8 exponent bits, no mantissa, spanning 2⁻¹²⁷ … 2¹²⁷) shared across a 32-element block. MXFP8 wraps fp8 at ≈ 8.25 bits/elem — fp8 already has range, so this is the near-lossless end. MXFP4 applies the same block to fp4 at ≈ 4.25 bits/elem, where the shared scale is what makes 4-bit usable at all. NVIDIA's NVFP4 tightens the block to 16 elements, swaps in a richer fp8 E4M3 block scale (a real float, with a mantissa, finer than a power of two), and adds a second per-tensor fp32 scale (≈ 4.50 bits/elem) — two levels of scaling to claw back both range and accuracy.

The clever 4-bit: NF4

If pretrained weights are roughly bell-shaped, why space the levels evenly at all? NF4 does not. It places its 16 levels as the quantiles of a normal distribution — dense where the weights actually sit, sparse out in the tails:

NF4 — put the levels where the weights actually are
pretrained weights ≈ Normal(0, 1)NF4 · 16 levels at normal quantilesexact 0int4 · 16 evenly-spaced levelssame step everywhere−10+1
dense near 0sparse in the tailssmallest gap ≈ 0.08 (around 0)largest gap ≈ 0.30 (near ±1)

int4 spreads its 16 levels uniformly, so it spends as many codes on the rare tails as on the common centre. NF4 instead places its 16 levels at the quantiles of a normal — dense near the peak, sparse in the tails — which is information-theoretically optimal for normal-shaped data. NF4 is a fixed 16-entry lookup table (not a computed grid), includes an exact 0, and is asymmetric (7 negative + 0 + 8 positive = 16).

These are unit-range shapes: real magnitude is restored by a per-block scale (block size 64). QLoRA additionally double-quantizes those scales. Used by QLoRA (arXiv:2305.14314).

Uniform int4 spends as many of its 16 codes on the rare tails as on the crowded center; NF4 follows the data, so more of its resolution lands where the weights are. It is information-theoretically optimal for normally-distributed weights, includes an exact 0, and is really a 16-entry lookup table rather than a computed grid — the 4-bit base that QLoRA finetunes on top of.

Picking a recipe

Those pieces — a format, a scale, a granularity — combine into named methods. They sort onto two axes: how much you quantize (weights only, or weights and activations), and how much retraining you are willing to do (PTQ, just calibrate; or QAT, retrain through fake-quant):

The quantization-method map · what & how-much-retrain
weight-onlyW8A8 (weights + activations)
QAT · retrainPTQ · calibrate onlyHOW MUCH retraining ↑weight-onlyweights + activations (W8A8)RTNGPTQAWQGGUFQLoRA / NF4SmoothQuantLLM.int8()

positions are categorical (which side / how much retrain), not a measured score

AWQquantizes: weight-onlybits: 4-bitpaper: arXiv 2306.00978

Naming trap: "Activation-aware" but WEIGHT-ONLY. Activation magnitude only PICKS the ~1% salient weight channels to protect via per-channel scaling.

Most local-inference quant (GPTQ / AWQ / GGUF / NF4) is weight-only PTQ: it shrinks memory and speeds the memory-bound decode without touching activations. W8A8 (SmoothQuant, LLM.int8()) also quantizes activations for a faster integer matmul — but must tame the outliers. QAT (retrain through fake-quant) is rare for open weights because it needs a full training loop. Two traps to remember: AWQ is weight-only despite the "Activation-aware" name, and a GGUF Q4_K4.5 bpw, not 4 (other Q4* variants differ).

For open weights you almost always do PTQ — QAT needs a full training loop. Most local-inference quant (GPTQ, AWQ, GGUF, NF4) is weight-only: it shrinks memory and speeds the memory-bound decode without touching activations. Two traps worth naming: AWQ is weight-only despite the “Activation-aware” name (activations only pick which weight channels to protect), and a GGUF Q4_K is about 4.5 bits per weight, not 4, once you count its stored block scales (other Q4* variants land elsewhere).

The one you actually download: GGUF k-quants

On Hugging Face, Ollama, or LM Studio you do not pick “int4” — you pick a GGUF Q-type. These are the llama.cpp file formats, and the “_K” ones are k-quants: a super-block of 256 weights with a two-level scale — one or two super-block floats (a second, dmin, for types with an asymmetric zero-point) plus a small quantized scale per sub-block of 16 or 32 weights, depending on the exact type. That is the same block-format trick you just saw in MX, which is exactly why their cost is a fractional bits-per-weight, never a round number. Each rung below is the pure-type bits/weight, read straight off the C struct in ggml-common.h:

The GGUF Q-type ladder — pure bits per weight
🔒 bf16 = 16 bpw · the tab ships this (off the chart, ~2× the widest rung)
1-bit floor4-bit sweet spotbar length = bits/weight (pure type) · click a rung
12345678bf1616IQ1_S≈ 1.56IQ2_XXS≈ 2.06Q2_K2.625IQ3_XXS≈ 3.06Q3_K3.4375IQ4_XS≈ 4.25Q4_K4.5Q5_K5.5Q6_K6.5625Q8_08.5
Q4_K4.5 bits/weight (pure type)struct: 144 B ÷ 256 w × 8 = 4.5

The default pick. 4-bit k-quant — the knee of the curve where quality loss becomes hard to notice.

you download: Q4_K_M ≈ 4.9 bpw on an 8B: half of attn_v & ffn_down bumped to Q6_K

Two honest traps. First, the names everyone actually downloads — Q4_K_M, Q3_K_M, Q5_K_M — are not pure types but mixes: the _M keeps the most sensitive tensors higher — always attn_v and ffn_down, plus attn_o in Q3_K_M — so the file lands a bit above the pure number (an 8B Q4_K_M is ≈ 4.9 bpw, not 4.5). Second, the IQ rungs (I-quants) are not rounded grids at all — each weight is an index into a fixed codebook, chosen with the help of an importance matrix: run a little calibration text through the model, see which weights move the output most, and spend the bits there. Below ~3 bits that imatrix is what makes the file usable at all.

Generalize that “keep the sensitive tensors high” idea to every layer and you get dynamic quantization — the heart of Unsloth's Dynamic 2.0 GGUFs. Same size budget, spent where calibration says it matters:

Spend the bits where they matter — dynamic / imatrix quant
taller = more bitskept high (sensitive)
6embed2q2k6v2o2gate2up6down6outrelative · schematicerror vs originaluniformdynamic

Same rough size budget, spent differently. The sensitive tensors — attn_v, ffn_down, embed, output — are kept at 6-bit; the rest drop to 2-bit. attn_v & ffn_down are the tensors every Q*_K_M mix already bumps — Dynamic 2.0 just makes it per-layer and calibration-driven.

Unsloth's Dynamic 2.0 picks the quant type per layer, per model from a > 1.5M-token calibration set, and grades the result by KL-divergence to the original rather than perplexity (perplexity can cancel its own errors out — see “Accuracy is Not All You Need,” arXiv 2407.09141). Pushed to the floor, the same idea squeezed a 671B DeepSeek-R1 from ~720 GB down to 131 GB at a ~1.58-bit IQ1 average — keeping the few dense and attention layers at 4–6 bit while the MoE experts drop to ~1.5. One caveat worth keeping straight: that “1.58-bit” is post-training dynamic quantization, not the same thing as BitNet b1.58 (arXiv 2402.17764), which trains ternary weights {−1, 0, 1} from scratch — they just share the number log₂3 ≈ 1.58.

Where the hardware draws the line

Each format exists because the previous one ran out of range or precision for some workload — and the silicon followed. NVIDIA Turing (2018) first added int8/int4 Tensor Cores for inference; Ampere (A100) added tf32/bf16/fp64 Tensor Core support and structured sparsity; Hopper (H100) added native fp8 (E4M3/E5M2); Blackwell (B200) added fp4 (NVFP4/MXFP4), with Tensor-Core throughput roughly doubling each time the precision halves. Apple-Silicon GPUs — and consumer cards before NVIDIA’s Blackwell generation (the RTX 50-series) — lack native sub-8-bit float units, so low-bit inference there leans on int4/int8 integer math — which is exactly why aggressive integer quant is a local-inference story, not a datacenter one.

Honest footnote, same as the teaser: this browser tab ships bf16 for both the weights and the KV cache — no quantization at all, the highest-fidelity choice. Everything on this page is the menu of what you could trade that fidelity for: a smaller, faster copy of the same model, bought with a coarser grid. You are here, at the top of the ladder, looking down.