Local & edge inference

Go deeper · Chapter 16, The whole model — why run a small model on your own device at all, and what you trade for it.

The chapter just placed this model near the bottom of the size ladder: 852,985,920 parameters, far below the frontier systems most people mean by “a large language model.” That raises a fair question — if a hosted API can serve a model 100× bigger, why run a small one on your own device at all? The answer is that “small” and “on your device” buy things a datacenter cannot, and the choice is a genuine trade with wins on both sides.

The trade, made concrete

Running locally — on the edge, on the device the user is already holding — wins on four things the book calls out: latency (no network round-trip, saving tens to hundreds of milliseconds), independence (no server, no outage, works offline), privacy (the data never leaves the device), and cost (it's free for the developer — the user's own hardware does the work). It pays for those in capability and speed: edge hardware is a fraction of a datacenter GPU, it throttles thermally, the hardware×software matrix is fragmented, and it drains the battery. Neither side wins every row — that is the trade:

Local & edge inference — the trade, made concrete

Cloud datacenter

Local / edgeyou are here

latency

network round-trip

on-device · zero round-trip✓ wins

privacy

data leaves your device

data never leaves the device✓ wins

cost

per-token API · expensive GPUs

free for the developer · $0✓ wins

model size / capability

frontier · 100B+ to trillions✓ wins

small ≤~2B · quantized 100B+

availability

online-only

works offline✓ wins

hardware

datacenter H100✓ wins

your VRAM / unified memory

Neither side wins every row — that is the trade.

Why edge wins (the book’s four)

zero network latency — saving tens or even hundreds of milliseconds
independence — no internet, server, or downtime to depend on
privacy — the end user’s data never leaves their device
cost — edge inference is free for the developer

What edge pays for (the book’s four)

hardware — a fraction of the speed and power of datacenter GPUs
thermal constraints — phones and laptops throttle under sustained load
fragmented support — endless hardware × software combinations
battery life — inference drains laptop and phone batteries

Where can it run?

model size:

Step through model sizes — the smallest tier that can host it lights up.

0.8B → phone

Phones struggle past one or two billion parameters — this is the phone sweet spot.

🔒 you are here: 0.8B bf16 · this browser · WebGPU

Capacity vs speed (book p.90)

device	memory	bandwidth	approx. price
NVIDIA RTX 5090	32 GB	1,792 GB/s	~$5,000	less memory, faster
Apple M3 Ultra	512 GB unified	819 GB/s	~$10,000	far more capacity, slower

Apple’s unified memory buys far more capacity at a slower speed; the 5090 buys less memory but more bandwidth.

The synthesis: it’s both

The future of inference isn’t local or cloud — it’s both working together: small models and quick queries on end-user devices, more demanding workloads on datacenter GPUs. Browser libraries like WebLLM and other cross-platform standards bring this mainstream — this course demo is exactly that.

You are here — this tab is edge inference

This browser tab runs Qwen3.5-0.8B in bf16, batch 1, on WebGPU. It buys you privacy (your text never leaves the tab), zero network latency, and $0 cost — and pays for it in capability (a 0.8B model, not a frontier one) and speed (your GPU, not an H100). That is the trade, made concrete.

Where each size can run

Model size sets the floor. Phones struggle past a billion or two parameters — which is exactly where this model sits comfortably. A laptop runs an 8B in bf16; getting a 100B-plus model onto a high-end desktop takes the lever from an earlier sub-chapter — quantization is the bridge that fits a big model into small memory (via Ollama / llama.cpp), and a Mixture-of-Experts helps too by activating only a slice of its parameters per token. Past that, a frontier model in the hundreds of billions to trillions of parameters simply lives in the cloud; low-end laptops and Chromebooks run no meaningful local model at all.

And the hardware itself encodes a trade. A gaming GPU and a workstation can cost in the same ballpark and optimize for opposite things — capacity versus speed — which is the second table in the widget above: a 32 GB card with huge bandwidth against a 512 GB unified-memory machine that holds far bigger models but streams them slower.

You are here. This browser tab is edge inference: Qwen3.5-0.8B, bf16, batch 1, on WebGPU — no server, no API key, no upload. It buys you privacy, zero network latency, and $0 cost, and pays for it in capability (a 0.8B model, not a frontier one) and speed (your GPU, not an H100). The future of inference is not local or cloud but both — small models and quick queries on-device, heavier workloads in the datacenter — and this whole course has been the small-model, on-device half, running live in front of you.