Chapter 16 · The whole model, end to endLocal & edge inference

Local & edge inference

Go deeper · Chapter 16, The whole model — why run a small model on your own device at all, and what you trade for it.

The chapter just placed this model near the bottom of the size ladder: 852,985,920 parameters, far below the frontier systems most people mean by “a large language model.” That raises a fair question — if a hosted API can serve a model 100× bigger, why run a small one on your own device at all? The answer is that “small” and “on your device” buy things a datacenter cannot, and the choice is a genuine trade with wins on both sides.

The trade, made concrete

Running locally — on the edge, on the device the user is already holding — wins on four things the book calls out: latency (no network round-trip, saving tens to hundreds of milliseconds), independence (no server, no outage, works offline), privacy (the data never leaves the device), and cost (it's free for the developer — the user's own hardware does the work). It pays for those in capability and speed: edge hardware is a fraction of a datacenter GPU, it throttles thermally, the hardware×software matrix is fragmented, and it drains the battery. Neither side wins every row — that is the trade:

Local & edge inference — the trade, made concrete
Cloud datacenter
Local / edgeyou are here
latency
network round-trip
on-device · zero round-tripwins
privacy
data leaves your device
data never leaves the devicewins
cost
per-token API · expensive GPUs
free for the developer · $0wins
model size / capability
frontier · 100B+ to trillionswins
small ≤~2B · quantized 100B+
availability
online-only
works offlinewins
hardware
datacenter H100wins
your VRAM / unified memory

Neither side wins every row — that is the trade.

Why edge wins (the book’s four)
  • zero network latency saving tens or even hundreds of milliseconds
  • independence no internet, server, or downtime to depend on
  • privacy the end user’s data never leaves their device
  • cost edge inference is free for the developer
What edge pays for (the book’s four)
  • hardware a fraction of the speed and power of datacenter GPUs
  • thermal constraints phones and laptops throttle under sustained load
  • fragmented support endless hardware × software combinations
  • battery life inference drains laptop and phone batteries
Where can it run?

Step through model sizes — the smallest tier that can host it lights up.

0.8B3B8B70B400B🔒 you are here
0.8Bphone

Phones struggle past one or two billion parameters — this is the phone sweet spot.

🔒 you are here: 0.8B bf16 · this browser · WebGPU
Capacity vs speed (book p.90)
devicememorybandwidthapprox. price
NVIDIA RTX 509032 GB1,792 GB/s~$5,000less memory, faster
Apple M3 Ultra512 GB unified819 GB/s~$10,000far more capacity, slower

Apple’s unified memory buys far more capacity at a slower speed; the 5090 buys less memory but more bandwidth.

The synthesis: it’s both

The future of inference isn’t local or cloud — it’s both working together: small models and quick queries on end-user devices, more demanding workloads on datacenter GPUs. Browser libraries like WebLLM and other cross-platform standards bring this mainstream — this course demo is exactly that.

You are here — this tab is edge inference

This browser tab runs Qwen3.5-0.8B in bf16, batch 1, on WebGPU. It buys you privacy (your text never leaves the tab), zero network latency, and $0 cost — and pays for it in capability (a 0.8B model, not a frontier one) and speed (your GPU, not an H100). That is the trade, made concrete.

Comparison of cloud-datacenter versus local/edge inference across six axes (latency, privacy, cost, model size, availability, hardware), with a stepped model-size control from 0.8B to 400B that lights the smallest tier able to host each size — phone, your laptop in bf16, your laptop quantized, or cloud-only — and a locked "you are here: 0.8B bf16 in this browser via WebGPU" marker at the small end.

Where each size can run

Model size sets the floor. Phones struggle past a billion or two parameters — which is exactly where this model sits comfortably. A laptop runs an 8B in bf16; getting a 100B-plus model onto a high-end desktop takes the lever from an earlier sub-chapter quantization is the bridge that fits a big model into small memory (via Ollama / llama.cpp), and a Mixture-of-Experts helps too by activating only a slice of its parameters per token. Past that, a frontier model in the hundreds of billions to trillions of parameters simply lives in the cloud; low-end laptops and Chromebooks run no meaningful local model at all.

And the hardware itself encodes a trade. A gaming GPU and a workstation can cost in the same ballpark and optimize for opposite things — capacity versus speed — which is the second table in the widget above: a 32 GB card with huge bandwidth against a 512 GB unified-memory machine that holds far bigger models but streams them slower.

You are here. This browser tab is edge inference: Qwen3.5-0.8B, bf16, batch 1, on WebGPU — no server, no API key, no upload. It buys you privacy, zero network latency, and $0 cost, and pays for it in capability (a 0.8B model, not a frontier one) and speed (your GPU, not an H100). The future of inference is not local or cloud but both — small models and quick queries on-device, heavier workloads in the datacenter — and this whole course has been the small-model, on-device half, running live in front of you.