Trace a token's path through all of Qwen3.5-0.8B in one diagram and explain how the hybrid linear/full-attention stack fits together — then name the core limits that fall straight out of next-token prediction: hallucination, no lookup database, a finite context, and stochastic output.
Every prior chapter zooms into a single component — it's easy to lose the forest for the trees. This chapter is the map that puts every piece back in its place, then steps back to the honest limits of the whole thing.
Glossary · 12 terms
- residual stream
- The un-normalized hidden vector that runs the full depth of the model. Every block reads it, computes a small contribution, and adds the result back.
- pre-norm block
- A sub-block laid out as x + sublayer(rmsnorm(x)): normalize the input, transform it, then add back onto the un-normalized residual highway. Keeps deep stacks trainable.
- hybrid attention
- Mixing two kinds of sequence mixer in one stack. Qwen3.5 alternates linear-attention layers with full softmax-attention layers on a fixed schedule.
- Gated DeltaNet (linear attention)
- A recurrent O(n) sequence mixer that compresses history into a fixed-size state instead of a growing KV cache. Used in 18 of the 24 layers.
- grouped-query attention
- Full softmax attention where several query heads share one K/V head. Qwen3.5 runs 8 query heads over 2 KV heads, shrinking the cache 4×.
- weight tying
- Reusing the embedding matrix as the LM head, so the same table maps token ids to vectors at the bottom and scores vectors into logits at the top.
- hallucination
- A fluent, confident output that isn't true — the result of sampling the most plausible token when no true one is available.
- grounding
- Tying a model output to a verifiable source (e.g. retrieved documents); a base model has none by default.
- context window
- The maximum number of tokens the model can attend to at once (262,144 for Qwen3.5-0.8B).
- parametric memory
- Knowledge stored implicitly in the model's weights during training, not in any queryable database.
- knowledge cutoff
- The point in time after which the model knows nothing, because its training data ends there.
- stochastic
- Random by default — sampling means the same prompt can produce different outputs run to run.
The whole model on one page
One sentence: Qwen3.5-0.8B is a stack of 24 near-identical pre-norm residual blocks bookended by an embedding table at the bottom and a tied LM head at the top — and almost everything you read in the earlier chapters lives inside one of those blocks. The poster below is the map: it lays the whole text decoder out as a central spine, with the dense SwiGLU feed-forward network expanded on the left and the two sequence mixers — linear (Gated DeltaNet) and full (Gated GQA) — expanded on the right. Hover, click, or tab any block to read what it does in the panel underneath.
One honest caveat on “whole model”: the spine traces the text decoder — the path the in-browser demo actually runs. The full Qwen3.5-0.8B checkpoint is multimodal, so the poster also sketches the optional vision encoder (greyed at the lower left); it is never exercised in this text-only demo.
This is all of Qwen3.5-0.8B in one picture. Your text enters at the bottom as token ids, becomes vectors (embeddings), then flows upward through 24 repeating layers — each one mixes information across the sequence (attention) and then transforms every token (the SwiGLU feed-forward) — before a final normalization and the LM head turn the top vector into one score per vocabulary token. Click, hover, or tab any block to read what it does. Most layers use a cheap "linear" mixer and a few use full attention — use the toggle above to compare them.
One thing the colors can hide: every box in that poster is a slab of frozen weights, fixed at the end of training. The only thing that actually flows up through all of them, layer by layer, is your text — and at this point it isn't words or meaning anymore, just a [your tokens × 1024] array of plain real numbers. The whole forward pass is that one tensor being reshaped, added to, and reshaped again on its way to the top. The boxes are the program; the tensor is the only thing moving.
Same block, 24 times
A token enters as an integer id, gets looked up in the embedding matrix, and becomes a 1024-dim vector. That vector is the start of the residual stream — the un-normalized highway that runs the full depth of the model. Each of the 24 layers does the same two things to it: a sequence-mixing sub-block (attention of some kind) and a per-token feed-forward sub-block, each wrapped in the now-standard pre-norm + residual pattern.
Crucially, the feed-forward half never changes: every layer carries a dense SwiGLU MLP (1024 → 3584 → 1024), with no mixture-of-experts routing anywhere in this checkpoint. The norm is RMSNorm, applied to the sub-block input only, so the residual highway itself stays un-normalized and gradients reach all the way down. After the last layer, a final RMSNorm and the LM head turn the hidden state into one logit per vocabulary token.
It's fair to ask why repeat one block 24 times at all. The common reading is that depth buys composition: an early layer can settle something small and local — say, that a pronoun points back at a name a few tokens ago — and a later layer can lean on that resolved fact to decide something larger, like which clause the sentence is really about. Each block only adds a small nudge to the residual stream, so stacking lets the nudges build into the kind of multi-step shaping a single block could never do alone. That's the intuition, and it's a useful one — but it should be held loosely: this picture comes mostly from studying dense, attention-only models, and how cleanly it carries over to this hybrid stack, where 18 of 24 layers are linear-recurrent rather than full attention, is itself not well established.
Two kinds of sequence mixer
The one thing that does vary layer to layer is the sequence mixer. Qwen3.5 is a hybrid: it runs a fixed [linear, linear, linear, full] schedule and repeats it six times, so 18 of the 24 layers use linear attention (Gated DeltaNet) and the remaining 6 use full grouped-query attention. The full-attention layers are global — every position can look back at every other — but their KV cache grows with the sequence. The linear layers are recurrent and cheap: they fold history into a fixed-size state, so they cost the same memory at token 10 and token 100,000.
The 3:1 ratio is the bargain. Full attention is the only mixer that can reliably reach back and recall one specific earlier token, so the model keeps a sprinkling of it; the linear layers carry the bulk of the language flow at a fraction of the cache. The result has most of the quality of an all-attention model with a memory footprint that barely moves as context grows.
One more way to see the same machine: where do the ~0.8 billion parameters actually live? Count them block by block and the FFN stack turns out to be the biggest single block of every layer — bigger than both sequence mixers combined:
| block | the arithmetic | params | share |
|---|---|---|---|
| SwiGLU FFNs × 24 | 3 × (1,024 × 3,584) = 11,010,048 per layer × 24 | 264.2M | 31.0% |
| Embedding (tied LM head) | 248,320 × 1,024 — counted once: the LM head reuses it | 254.3M | 29.8% |
| Gated DeltaNet mixers × 18 | qkv 6,291,456 + z 2,097,152 + out 2,097,152 + conv/β/decay 57,504 = 10,543,264 per layer × 18 | 189.8M | 22.2% |
| Full attention (GQA) × 6 | q 4,194,304 (gate doubles it) + k 524,288 + v 524,288 + o 2,097,152 + norms 512 = 7,340,544 per layer × 6 | 44.0M | 5.2% |
| RMSNorms | 24 layers × 2 × 1,024 + final 1,024 | 0.05M | 0.0% |
| Vision encoder (off the text path) | 12-layer ViT — in the checkpoint, never run in this text-only demo | 100.6M | 11.8% |
| total | 264.2 + 254.3 + 189.8 + 44.0 + 0.05 + 100.6 | 853.0M | 100% |
Counts read straight from the checkpoint's tensor shapes — they sum to exactly 852,985,920. Two things to notice: weight tying means the 248,320 × 1,024 embedding table is bought once but used twice (token → vector at the bottom, vector → logits at the top), and the dense SwiGLU FFNs are the single biggest block — bigger than both sequence mixers combined (264.2M vs 233.8M for GDN + attention).
Advanced: inside a Gated DeltaNet layer · optional, for the curious
This is what the teal Gated DeltaNet boxes in the poster above mean, unpacked. You can skip it — the 3:1 bargain above is the whole story for using the model. But here is what actually happens inside one of the 18 linear layers:
- Projections. Linear maps split the input into query/key, a value plus a z-gate, and a raw beta/decay pair (b, a) — a sigmoid turns b into β and a decay formula turns a into g, the knobs the recurrence below turns.
- A short conv. A causal depthwise convolution (kernel 4, so each channel only looks at the last 4 tokens) followed by a SiLU activation adds a little local mixing before the recurrence.
- The gated delta-rule recurrence is the heart of it. It carries one fixed-size state and walks the sequence token by token: β controls how much each new token overwrites that state, and g (decay) controls how fast old information is forgotten. That makes it O(n) in the sequence length — linear (cost grows in step with the token count), not quadratic (full attention's cost grows with the square of it — double the tokens, roughly four times the work) — with a state whose size never grows.
- Memory. Instead of a KV cache that grows with every token, the layer carries only a tiny conv cache (the last few tokens) plus that fixed-size recurrent state — so its memory is constant regardless of context length.
- Finish. A z-gated RMSNorm and an out-projection return the result to the 1024-dim residual stream.
That constant-memory recurrence is exactly what lets 18 of the 24 layers run without a growing cache, leaving only the 6 full-attention layers to pay for one.
Before we name the limits, one more thing to keep in scale: this is a small model. Its 852,985,920 parameters put it near the bottom of the size ladder, far below the systems most people mean by “a large language model.” That doesn't make it a toy — but it does mean some of the limits below are partly about size, and would soften (never vanish) on a much bigger model.
Only one number here is exact: the 852,985,920 parameters of the model running in your browser. The other rungs are rough orders of magnitude — the log axis means each tick to the right is a roughly tenfold jump. The honest takeaway is the gap, not a headline figure: the largest systems are on the order of a hundred to a thousand times bigger than this one. That distance is worth keeping in mind as we turn, next, to what even a well-built model of this size genuinely cannot do.
What the model isn't
You've now seen the whole machine, end to end — every block, and where it sits. It's worth being equally precise about what it is not doing, because nearly every "the LLM is broken" surprise comes from expecting a capability the architecture never had. None of the limits below are bolted-on flaws — they fall straight out of next-token prediction.
It samples the plausible, not the true
The forward pass produces a probability for every next token, and we sample one. Nowhere in that pipeline is a fact checked against a source. When the true continuation was common in the training data, plausible and true coincide and the model looks like it "knows." When there's no true answer — or the model simply never saw it — it still emits a fluent, confident token. That's a hallucination:
Illustrative distribution — hand-picked to show the shape, not live output from the model.
A fact that appears constantly in the training data: the most-probable token is also the true one. But the model never looked anything up — truth and plausibility just happen to coincide here.
Notice it isn't lying (there's no intent) and it isn't malfunctioning — it's doing exactly what it was built to do: produce a likely-looking next token. The flatter the distribution, the less sure it is, but it commits to an answer regardless unless training taught it to hedge.
There is no lookup database
A base model's "knowledge" is parametric: baked into its weights during training, not stored in a table it queries at inference time. So it can't reliably cite a source it didn't effectively memorize, and it has a knowledge cutoff — anything that happened after training simply isn't in the weights. (Retrieval and tools can bolt a real database on top, but that's an addition; the model you've studied has none.)
Its memory is the context window — and it's finite
The model can only attend to tokens inside the window — it's a hard ceiling, not a sliding memory. Drag the slider past 262,144 and watch the oldest blocks drop out of the frame: to make room for each new token, one old token has to go, and anything dropped is gone unless it's re-supplied in a later prompt. There is no long-term store; the window is the model's entire working memory.
The model's only working memory is the tokens currently inside its context window. Past that hard limit the oldest tokens have to be dropped to make room, and once dropped they're gone. And between two separate conversations it remembers nothing at all — every call starts blank, apart from whatever text you re-supply in the prompt.
Same input, different output
Because we sample, generation is stochastic by default: with temperature above zero, the same prompt can produce different answers on different runs. That's a feature for creative writing and a footgun for anything you need to reproduce — decode greedily (temperature 0) when you want determinism, as the Sampling chapter showed.
None of this is a bug
These aren't defects waiting for a patch — they're the honest shape of a system that predicts one plausible token at a time. Knowing them is exactly what separates using an LLM well from being blindsided by it. That's the real payoff of having traced the whole forward pass: the model stops being magic, and starts being a tool whose strengths and edges you can actually reason about.
- Qwen3.5-0.8B is 24 pre-norm residual blocks; the only thing that changes layer to layer is the sequence mixer — linear (Gated DeltaNet) on 18 layers, full attention on 6.
- Full attention is global but its KV cache grows with sequence length; linear attention is cheap and recurrent with a fixed-size state. The 3:1 schedule keeps most of the quality at a fraction of the memory.
- The embedding matrix and the LM head are the same weights (tying) — one table both maps tokens in and scores them out. Every layer also carries a dense SwiGLU FFN — there is no mixture-of-experts anywhere.
- Hallucination isn't a malfunction: the model samples the most plausible token, and plausible only equals true when the fact was well-represented in training.
- The model has no database and no memory between calls — its knowledge lives in fixed weights (with a cutoff), and its working memory is only the current context window.
- Sampling makes generation stochastic by default; identical prompts can diverge unless you decode greedily.
Two things to try on this page. (1) In the poster, flip the GDN ↔ GQA toggle and read the two right-hand panels — name two things the full-attention layer (GQA) has that the linear layer (Gated DeltaNet) doesn't. (2) In the hallucination widget below, switch to 'No real answer' and compare the top bar's height to the 'Well-known fact' case — does the flatter distribution stop the model from answering?