Chapter 1 · What is an LLM?

State, in one sentence, what an LLM computes — and trace how that single computation, run in a loop, produces text.

Every later chapter zooms into one component. Without the big picture first, the machinery has no purpose to hang on.

6 min
Glossary · 10 terms
language model
A function that scores how likely each possible next token is, given the tokens so far.
token
The unit an LLM reads and writes — a sub-word chunk, not a character or a whole word.
logits
The raw, unbounded scores the model outputs — one per vocabulary entry — before softmax turns them into probabilities.
softmax
Exponentiate every logit and divide by their sum: e^(z_i) / Σ_j e^(z_j). Turns a vector of raw scores into a probability distribution that sums to 1.
forward pass
One evaluation of the model on a token sequence, producing the logits for the next position.
weights
The model's 852,985,920 stored numbers — set once during training, then frozen. The same for every request; they ARE the model.
activations
The intermediate vectors computed fresh for your specific prompt during a forward pass, and discarded when the pass is done.
autoregressive
Generating one token at a time, feeding each output back in as input for the next step.
sampling
Picking the next token at random, weighted by the softmax probabilities. "Greedy" decoding always takes the single top-scoring token instead.
generation loop
forward pass → sample one token → append it → repeat, until a stop token or a length limit.

What is an LLM?

Strip away the mystique and a large language model is a single function. You hand it a list of tokens — the text so far — and it returns a score for every token in its vocabulary: a guess at what comes next. That is the entire computation. Everything else in this course is either how that function is built, or how it's used.

One call: tokens in, a score for every next token

Concretely, the input is just a list of integers — one id per token, produced by the tokenizer — and the output is a big array of decimals, one score per vocabulary word:

tokens: number[]          // a list of integers, one id per token,
                          // e.g. [760, 7993, 7338, 383, 279] = "The cat sat on the"
   │
   ▼  one forward pass
logits: Float32Array(248,320)   // an array of decimals — one raw score per vocab word

Those raw scores are called logits. Run them through a softmax and you have a probability distribution over "what's next" — for Qwen3.5-0.8B, a distribution over all 248,320 tokens it knows. The model never outputs a word directly; it outputs a score for every word it could output, and a separate rule picks one.

That "run them through a softmax" step is small enough to do by hand — here it is on a toy five-token vocabulary, so you can see exactly how raw scores become probabilities before we scale it up to all 248,320:

Softmax on a 5-token toy vocabulary
×1.00

The real model emits 248,320 logits — one per vocabulary token. The transform that turns them into probabilities is the same no matter how many there are, so here it is on just five. Drag the slider to scale the raw logits: a bigger factor sharpens the distribution onto the top token, a smaller one flattens it.

tokenzi (logit)exp(zi)÷ Σpi
"cat"2.4011.023÷ 19.0640.578
"dog"1.604.953÷ 19.0640.260
"ran"0.702.014÷ 19.0640.106
"the"-0.300.741÷ 19.0640.039
"."-1.100.333÷ 19.0640.017
Σ19.0641.000
Probabilities (the last column, as bars)
·cat
0.578
·dog
0.260
·ran
0.106
·the
0.039
.
0.017

Illustrative five-token vocabulary with hand-picked logits — not live output from the model.

The Sharpen ↔ flatten slider scales the raw logits before the softmax: drag it up and the model commits harder to its single top guess (the distribution spikes), drag it down and probability spreads more evenly across the options. That is exactly the knob temperature tunes when you generate text — here it acts as an inverse temperature (higher = sharper).

Two kinds of numbers: weights vs activations

Before going further, separate the numbers involved into two buckets — it makes everything later click. The first bucket is the weights: the 852,985,920 stored numbers that are Qwen3.5-0.8B. They were set once, during training, and have been frozen ever since — at inference they are read-only, and they are bit-for-bit identical whether it's your prompt or anyone else's. The second bucket is the activations: the intermediate vectors the forward pass computes for your specific tokens as they flow through the layers. They're born when your request arrives and thrown away the moment the pass finishes. (One sliver survives across loop steps — the KV cache — but that's a within-conversation optimization, not memory of you.)

Weights vs activations
Weights the brains
❄ frozen
  • 852,985,920 numbers
  • learned once, during training
  • read-only at inference — identical for every request
Activations your prompt's data
↻ fresh each request
your prompt
embeddings
layer 1
⋯ layers 2–23 ⋯
layer 24
logits
discarded after the forward pass
weights — fixed, shared by everyoneactivations — computed fresh for your prompt

Schematic, to show the split — not the model's live state.

The split is also a map of the course: training (the training chapter) is the process that changes the left bucket, while inference — everything you'll do in this course — only ever fills and empties the right one. And it explains a fact people find surprising: the model is the same artifact for every user, and it remembers nothing between calls, because nothing your prompt computes is ever written back into the weights.

Where do those numbers come from? Fitting, not coding

So where did 852,985,920 specific numbers come from? Not from anyone writing them. Classic software is rules a programmer spells out by hand: if the email contains "free money", mark it as spam. Every behavior is a line of code someone reasoned out and typed. A language model is built the opposite way. Nobody could possibly hand-write a rule for "what word comes next in any sentence" — so instead you start with the numbers at random, show the model mountains of real text, and repeatedly nudge those numbers so its guesses inch closer to what actually came next. The behavior is never spelled out; it is fit to examples. (That fitting process — the training chapter — is the same idea for every number in the model, no matter which kind of layer it lives in.)

The smallest version of this you can hold in your head is a straight line. The line y = a·x + b has exactly two knobs: the slope a and the intercept b. Give it some scattered example points and "training" just means turning those two dials until the line sits as close to the points as it can:

Two knobs, fitted to examples

y = a·x + b — slope a and intercept b are the only two knobs

Training nudges those two numbers until the line sits as close to the dots as it can.

That is the entire trick, scaled up almost beyond belief. A line has 2 knobs. Qwen3.5-0.8B has 852,985,920 — about 426 million times more (852,985,920 ÷ 2 = 426,492,960). Same move, unimaginably more dials: instead of a slope and an intercept bending one line, you are fitting hundreds of millions of numbers so that one enormous function lands close to "the next token humans actually wrote," across nearly everything people have written down.

The generation loop

A single forward pass gives you one token's worth of prediction. To write a sentence, you call the function in a loop — sample one token, append it, and run again on the slightly longer list. ("Sample" just means: pick one token at random, weighted by its probability — the likeliest token usually wins but doesn't always. "Greedy" mode skips the dice and always takes the top one.)

The generation loop
The·cat·sat·on·the
tokens = tokenize(prompt)
while (!done) {
logits = model(tokens) // one forward pass
next = sample(logits) // pick one token
tokens.push(next) // append, then repeat
}
Forward pass → a score for every token
·floor
0.420
·mat
0.190
·rug
0.120
·couch
0.080
·bed
0.050

Illustrative numbers, scripted to show the loop's shape — not live output from the model.

An LLM is a function from a list of tokens to a score for every token in its vocabulary. To write more than one token it runs in a loop: forward pass → sample one token → append it → run again on the longer list. That single repeated step is all "generating text" is.

This is what "generating text" means: it is autoregressive. Each new token is produced by re-running the whole model on a sequence that is one token longer than last time. (Re-running everything every step sounds wasteful — the KV-cache chapter is how it's made cheap.)

The whole pipeline, in one breath

Inside that one forward pass, a token's journey is: text → tokensvectors (embeddings) → a deep stack of attention + MLP blocks → a final score for every token (the LM head) → sample one → append, and loop. (That "deep stack" is hybrid: 6 of the 24 layers use the attention you'll meet in the attention chapter; the other 18 use a cheaper shortcut, covered in the KV-cache chapter.) Every remaining chapter opens up one of those boxes; the architecture chapter puts them back on one page.

Better yet, here is that whole journey as something you can step through — one concrete token, from raw text to the next word, with the shape of the data shown at every step:

Follow one token's journey
"The cat sat on the" → ?
Right now the data is:[760, 7993, 7338, 383, 279] · 5 token ids
Stage 1 of 6: Text → tokens. Data now: [760, 7993, 7338, 383, 279] · 5 token ids.
Stage 1 / 6Text → tokens

The model can't read raw text. The tokenizer first chops your text into known chunks called tokens — here, 5 of them — and looks up each one's id (a plain integer). From here on, the model only ever sees these numbers.

The760
cat7993
sat7338
on383
the279

the leading space is part of the token.

Scrub the rail, or press Play to watch the forward pass run.

Illustrative — a schematic of the real forward pass; the numbers and heatmaps are stand-ins for the real 1024-dim vectors and 248,320 logits, not live model output.

Why "predict the next token" is enough

It sounds almost too simple to be interesting. The catch is the standard it's held to: to predict the next token well, across all of human text — essays, code, dialogue, translation, arithmetic — the model has to internalize grammar, facts, and patterns of reasoning, because those are what make the next token predictable. Capability is a side effect of getting very good at one narrow objective.

One honest caveat up front: what we've described is a base model — pure autocomplete. The jump to a ChatGPT-style assistant that follows your instructions is a separate training story, covered later in "From base model to assistant."

A second honest caveat: how coherent that next-token guessing feels depends heavily on size. More knobs to fit means more of those grammar, fact, and reasoning patterns can be captured — so bigger models tend to stay on-topic for longer. The one running in this course is deliberately small — 0.85B parameters, chosen so it can run entirely inside your browser tab. It is not a shrunken or out-of-date version of anything: it is Qwen3.5-0.8B, released in 2026, whose 24 layers mix 6 full-attention layers with 18 cheaper GatedDeltaNet layers. But at this size, with no instruction tuning, expect it to wander — start a sentence well, then drift somewhere strange. For a sense of the ladder it sits on: this model is ~0.85B parameters, a small open model might be ~1.5B, and the original GPT-3 was ~175B — roughly 200× larger than the model running here (not the 1.5B one). You can watch our 0.85B model wander, and steer how wildly it does, with the temperature controls in the sampling chapter.

Engineering takeaways
  • An LLM is one function: a list of tokens in, a score for every possible next token out.
  • Text is generated one token at a time — the same forward pass runs in a loop, each pass seeing one more token.
  • "Predict the next token" is the whole job at inference time; capability is what that objective forces the model to learn.
Try this

In the generation-loop animation, watch the pseudocode. Which line is highlighted at the exact moment a new token joins the strip on the left?

Quick check
1. Mechanically, what does an LLM compute in a single forward pass?
2. How does an LLM produce a whole paragraph rather than just one token?
3. What are "logits"?