Local & edge inference
Go deeper · Chapter 16, The whole model — why run a small model on your own device at all, and what you trade for it.
The chapter just placed this model near the bottom of the size ladder: 852,985,920 parameters, far below the frontier systems most people mean by “a large language model.” That raises a fair question — if a hosted API can serve a model 100× bigger, why run a small one on your own device at all? The answer is that “small” and “on your device” buy things a datacenter cannot, and the choice is a genuine trade with wins on both sides.
The trade, made concrete
Running locally — on the edge, on the device the user is already holding — wins on four things the book calls out: latency (no network round-trip, saving tens to hundreds of milliseconds), independence (no server, no outage, works offline), privacy (the data never leaves the device), and cost (it's free for the developer — the user's own hardware does the work). It pays for those in capability and speed: edge hardware is a fraction of a datacenter GPU, it throttles thermally, the hardware×software matrix is fragmented, and it drains the battery. Neither side wins every row — that is the trade:
Neither side wins every row — that is the trade.
- zero network latency — saving tens or even hundreds of milliseconds
- independence — no internet, server, or downtime to depend on
- privacy — the end user’s data never leaves their device
- cost — edge inference is free for the developer
- hardware — a fraction of the speed and power of datacenter GPUs
- thermal constraints — phones and laptops throttle under sustained load
- fragmented support — endless hardware × software combinations
- battery life — inference drains laptop and phone batteries
Step through model sizes — the smallest tier that can host it lights up.
Phones struggle past one or two billion parameters — this is the phone sweet spot.
| device | memory | bandwidth | approx. price | |
|---|---|---|---|---|
| NVIDIA RTX 5090 | 32 GB | 1,792 GB/s | ~$5,000 | less memory, faster |
| Apple M3 Ultra | 512 GB unified | 819 GB/s | ~$10,000 | far more capacity, slower |
Apple’s unified memory buys far more capacity at a slower speed; the 5090 buys less memory but more bandwidth.
The future of inference isn’t local or cloud — it’s both working together: small models and quick queries on end-user devices, more demanding workloads on datacenter GPUs. Browser libraries like WebLLM and other cross-platform standards bring this mainstream — this course demo is exactly that.
This browser tab runs Qwen3.5-0.8B in bf16, batch 1, on WebGPU. It buys you privacy (your text never leaves the tab), zero network latency, and $0 cost — and pays for it in capability (a 0.8B model, not a frontier one) and speed (your GPU, not an H100). That is the trade, made concrete.
Where each size can run
Model size sets the floor. Phones struggle past a billion or two parameters — which is exactly where this model sits comfortably. A laptop runs an 8B in bf16; getting a 100B-plus model onto a high-end desktop takes the lever from an earlier sub-chapter — quantization is the bridge that fits a big model into small memory (via Ollama / llama.cpp), and a Mixture-of-Experts helps too by activating only a slice of its parameters per token. Past that, a frontier model in the hundreds of billions to trillions of parameters simply lives in the cloud; low-end laptops and Chromebooks run no meaningful local model at all.
And the hardware itself encodes a trade. A gaming GPU and a workstation can cost in the same ballpark and optimize for opposite things — capacity versus speed — which is the second table in the widget above: a 32 GB card with huge bandwidth against a 512 GB unified-memory machine that holds far bigger models but streams them slower.
You are here. This browser tab is edge inference: Qwen3.5-0.8B, bf16, batch 1, on WebGPU — no server, no API key, no upload. It buys you privacy, zero network latency, and $0 cost, and pays for it in capability (a 0.8B model, not a frontier one) and speed (your GPU, not an H100). The future of inference is not local or cloud but both — small models and quick queries on-device, heavier workloads in the datacenter — and this whole course has been the small-model, on-device half, running live in front of you.