Chapter 15 · From base model to assistant

Explain how a base model becomes a helpful assistant — instruction tuning, preference tuning, and the chat template — without changing the architecture.

Everything so far describes a base model: pure autocomplete. But the LLM you actually use answers questions and follows instructions. Something bridges the two.

8 min
Glossary · 7 terms
pretraining
Next-token prediction on trillions of tokens of generic text — where the model gets its knowledge.
instruction tuning (SFT)
Supervised fine-tuning on curated (instruction, response) pairs; teaches the model to answer in the assistant role.
RLHF
Reinforcement Learning from Human Feedback — train the model against a reward model built from human preference comparisons.
DPO
Direct Preference Optimization — optimize directly on preferred-vs-rejected response pairs, skipping a separate reward model.
chat template
The formatting convention that turns a multi-turn conversation into the single token string the model reads.
special token
A reserved token (e.g. <|im_start|>, <|im_end|>) that marks structure — turn boundaries, roles, end-of-turn — rather than literal text.
tool calling
The chat template can list available tools in a <tools>…</tools> system block; the model then emits a structured <tool_call><function=NAME><parameter=NAME>VALUE</parameter></function></tool_call> instead of prose, which the app runs and feeds the result back. The live chat exposes this as the "tools" pill.

From base model to assistant

Here is a surprise that trips up almost everyone: the model the Training chapter described — a pure next-token predictor — would not behave like ChatGPT. Type "What is 2 + 2?" into a raw base model and it might continue with another question, or a list of homework problems, or anything that plausibly follows that string somewhere on the web. It doesn't answer, because nothing ever taught it that a question should be followed by a helpful reply. Three post-training stages fix that — and none of them touches the architecture. Here is that gap, side by side:

Same prompt, before and after instruction tuning
Base model (autocomplete)continues the text — does not answer it
What is 2 + 2?
After SFT (assistant)treats the question as something to answer
What is 2 + 2?

Both lanes run the same architecture on the same prompt. The base model just predicts what plausibly follows that string on the web — and on the web, one homework question is usually followed by more homework. Instruction tuning shifts what counts as “plausible”: after seeing curated (question, answer) pairs, the likeliest continuation of a question is its answer.

Scripted illustration — the real base model isn't loaded here (the checkpoint this site ships is already instruction-tuned).

Base model → assistant, in three stages
Pretrainingbase model
Data: Trillions of tokens of raw web text, books, code
Learns: Language and world knowledge
Plain next-token cross-entropy. Almost all the GPU-hours live here.
Instruction tuningSFT
Data: ~10K–1M curated (instruction, response) pairs
Learns: To follow instructions in the assistant role
Same next-token loss — only the dataset changes.
Preference tuningRLHF / DPO
Data: Human comparisons of candidate responses
Learns: To be helpful, harmless, and honest
A new objective — optimize a preference/reward, not plain CE.

Only the first stage needs the internet-scale corpus. The two post-training stages are comparatively tiny — they don't teach the model new facts so much as shape how it uses what pretraining already gave it.

Instruction tuning (SFT)

The first fix is the simplest. Take the pretrained model and keep training it — same forward pass, same next-token cross-entropy — but on a small, curated dataset of (instruction, response) pairs written or vetted by humans. After a few thousand to a few million examples, the model has learned the shape of the task: when the text so far is a user instruction, the most likely continuation is a helpful response in the assistant's voice. That's supervised fine-tuning, or SFT. It is tiny next to pretraining — it adds almost no new knowledge; it teaches the model how to use what it already knows.

One detail of the loss: the user's tokens get no loss — they're masked out, exactly like the unscored last position in the training chapter — and only the assistant's reply tokens are scored. Otherwise the model would be trained to imitate users too, instead of learning to answer them.

Pretraining vs SFT — data scale
Pretraining (next-token prediction)≈ 3T tokens
Instruction tuning (SFT)≈ 1M examples

SFT bar floored to remain visible on the linear scale — its true width here is ≈ 0.00003% of the pretraining bar.

≈ 3T tokens vs ≈ 1M examples — about 10^6× apart in raw count. (Illustrative of scale only: those are different units — tokens of generic text vs curated examples — so this is not a strict apples-to-apples ratio.)

Pretraining digests trillions of tokens to build the model's knowledge; SFT needs only thousands to millions of curated examples to reshape behavior into a helpful assistant. On the linear scale the SFT data all but vanishes — that's the point. Switch to log to see the gap as the multiple-orders-of-magnitude story it really is. SFT is tiny next to pretraining, which is why post-training is comparatively cheap and fast.

Illustrative orders of magnitude — real corpora vary widely. Tokens (pretraining) and examples (SFT) are different units, so the ratio reads as scale, not a literal apples-to-apples comparison.

Preference tuning (RLHF / DPO)

SFT makes the model follow instructions; preference tuning makes it follow them well — more helpful, more honest, less likely to produce harmful or evasive answers. Humans compare candidate responses ("A is better than B"), and the model is trained to prefer the responses people preferred. One quick definition first: reinforcement learning (RL) = try outputs, score each one, make high-scoring behavior more likely — learning from trial and error instead of from labeled targets. RLHF does this with a separate reward model — a second model trained on the human comparisons to predict how much people would like a given response, which the main model is then optimized to score well on — and reinforcement learning; DPO optimizes the preference directly, skipping the reward model. DPO isn't full RL and it isn't next-token cross-entropy either: it's a preference (classification-style) loss over (preferred, rejected) pairs that simply raises the model's relative likelihood of the preferred response over the rejected one. So this stage departs from the plain next-token loss you've seen — but, in the DPO case, it's still a supervised loss on a fixed dataset of pairs, not the trial-and-error rollouts of classic RL.

Pick which response a human prefers and apply the update a few times to see what "training on a preference" does. For scale: typical preference datasets are ~10K–1M comparison pairs — the same order of magnitude as SFT, nothing like pretraining's trillions of tokens.

Preference tuning — pick a winner, nudge
A human prefers one response; the update raises P(chosen)
Prompt: Explain recursion to a 5-year-old.
Response A👤 preferred

It's like a tiny robot that, to do a big job, makes a smaller copy of itself to do a smaller piece — until the piece is so small it just does it.

Response B

Recursion is when a function is recursive. It's a standard concept; you'll understand it eventually.

Human prefers:
P(A) — model probability of response Achosen50%
P(B) — model probability of response B50%
0 updates
P(chosen) — A
50%
P(rejected) — B
50%
Click "apply preference update" to nudge the chosen response (A) up and the rejected one (B) down.

Human prefers response A. Model probability of A is 50 percent, of B is 50 percent, after 0 preference updates. Method: RLHF.

How the update is computed
prompt + pairA vs B, human picked Areward modelscores responsesmodel updateRL step

RLHF trains a separate reward model from the human comparisons, then uses reinforcement learning to optimize the model against that reward.

Illustrative — this is a cartoon. Real preference datasets are thousands of pairs, and the “nudge” is one gradient step on a preference loss (DPO) or a reward-model-guided RL step (RLHF); the probabilities here are hand-set, not from a model.

Chat templates: how turns are wired

Through all of this, the model still only ever sees a flat string of tokens — it has no native idea of "messages" or "roles." The chat template is the convention that flattens a multi-turn conversation into that string, using special tokens to mark where each turn starts, who's speaking, and where a turn ends.

The chat template
Reasoning (the “think” pill)
<|im_start|>system You are a helpful assistant. Be concise.<|im_end|>
<|im_start|>user What is 2 + 2?<|im_end|>
<|im_start|>assistant 4<|im_end|>
<|im_start|>assistant <think> ▮ the model reasons here, then closes </think> and answers

The "assistant" is the same next-token predictor from every other chapter — it's just fed a string wrapped in role markers. The special tokens <|im_start|> / <|im_end|> tell it whose turn it is and where a turn stops. Instruction tuning is what teaches it to continue a trailing <|im_start|>assistant with a helpful answer instead of, say, inventing a third user question. The real Qwen3.5 template always injects a <think> reasoning block right after that marker — open (<think>\n) when the live chat's think pill is on, or pre-closed (<think>\n\n</think>\n\n) when it's off — which is the sub-toggle above. (A couple of rarer markers, like the tool-calling block below, are still elided for readability; the app fills those in for you.)

It's worth noticing how much of the "chatbot" feeling comes from this framing rather than from training. A base model only autocompletes. But if you hand even a base model a scene-setting system prompt — something like "the following is a conversation between a curious user and a knowledgeable, helpful assistant" — and then start the user's turn, the most plausible continuation is already an assistant-style reply, simply because that's what the surrounding text implies. Early GPT-3 demos in 2020 worked exactly this way: GPT-3 was a base model, and people coaxed assistant-like behavior out of it purely by writing a vivid prelude and a well-shaped prompt; the supervised and preference tuning that made it reliable came later. The Qwen system block you see in the widget above is terse — just a short instruction like "You are a helpful assistant" — so it does only part of that framing work, and SFT plus preference tuning does the rest. Framing gets you the first plausible reply; post-training is what makes the assistant show up every time instead of only when the scene is set just right.

When you chat with the model in this app, it wraps your conversation in this same format for you, appends a trailing <|im_start|>assistant, and lets the model generate the answer — stopping the moment it emits <|im_end|>. SFT is what taught the model to emit that end-of-turn token instead of inventing a fake next user message.

The same template threads in two more things you'll see as pills in the live chat. The think pill controls a reasoning block: right after the assistant marker the template injects either <think>\n (thinking on — the model reasons, then closes </think> before its answer) or a pre-closed <think>\n\n</think>\n\n (thinking off). The tools pill adds a <tools>…</tools> system block describing functions the model may call; instead of prose it then emits a structured call — <tool_call><function=get_weather><parameter=city>Paris</parameter></function></tool_call> — which the app executes and feeds back as another turn. Both are pure formatting conventions: the network is unchanged; SFT taught it to honor them.

It is still the same model

The thing to hold onto: none of this changed the network. Same 24 layers, same attention and MLP blocks, same forward pass producing logits for the next token. Post-training only nudged the weights so the function's outputs are shaped the way we want, and wrapped the input in a chat format. The "assistant" you talk to is the base model from every other chapter — wearing a chat template, with its weights gently steered toward being helpful.

Engineering takeaways
  • A base model only continues text; instruction tuning is what makes it answer instructions in an assistant role.
  • Post-training (SFT, then preference tuning) reuses the same model and mostly the same loss — it shapes behavior, it does not add new architecture.
  • A chat conversation is just a formatted string: special tokens mark whose turn it is; the model generates after a trailing assistant marker and stops at the end-of-turn token.
Try this

Toggle the chat-template widget to 'Raw text the model sees'. Which special token appears right before the model starts generating, and which one tells it to stop?

Quick check
1. What does a base (pretrained-only) model do if you type a question into it?
2. How does instruction tuning (SFT) differ from pretraining?
3. In the chat template, what is the role of special tokens like <|im_start|> and <|im_end|>?