Chapter 11 · SamplingTool calling & constrained decoding

Tool calling & constrained decoding

Go deeper · Chapter 11, Sampling — chapter 15 showed you the tool_call format; here is how a production system can make that format unbreakable, using nothing but this chapter's own sampling machinery plus one extra mask.

Letting a model call external functions goes back to ReAct (Yao et al., Oct 2022): interleave free-text reasoning with discrete actions — a lookup, a search — in the same generation stream, so the model can plan, act, observe, and re-plan. OpenAI productized the pattern in June 2023 as function calling (models gpt-4-0613 / gpt-3.5-turbo-0613): the model was fine-tuned to emit a JSON blob describing a function call instead of prose. Simon Willison, writing the same week, called it exactly what it was — “an implementation of the ReAct pattern, with models that have been fine-tuned to execute it.” And that is the catch: fine-tuning teaches a model to usually emit well-formed JSON, with no hard guarantee it always will.

Masking, not hoping

The fix doesn't touch training at all — it happens at the exact moment this chapter has spent the whole time on: picking the next token. Ordinarily you take the raw logits, maybe apply temperature, maybe keep only the top-k or top-p mass, then sample or argmax. Constrained (grammar-based) decoding adds one more step before that: track which tokens are still syntactically legal given the schema and everything generated so far, and mask every illegal one to before the usual sampling runs. The Outlines paper (Willard & Louf, 2023) formalizes this by compiling a regex or JSON-schema/CFG into a finite-state machine and indexing the vocabulary against FSM states; llama.cpp's GBNF grammars do the equivalent job in C++. Walk it step by step over the exact tag format chapter 15 introduced — <tool_call><function=get_weather><parameter=city>Paris</parameter></function></tool_call>:

Masking the grammar, one token at a time
Completion so far
<tool_call><function=get_weather><parameter=city>
Raw probabilities (unconstrained)
Paris
0.52
plain text — any content without a raw `<` is a valid string value
I
0.09
also valid — the grammar checks syntax, not meaning
<function=
0.19
a raw `<` here would start what looks like a new tag
{
0.12
this tag format never uses JSON braces
</parameter>
0.08
closes the value before any content — city can’t be empty
Grammar-masked → renormalized
Parispicked
0.85
plain text — any content without a raw `<` is a valid string value
I
0.15
also valid — the grammar checks syntax, not meaning
Mass kept: 0.61 — 2 of 5 candidates survive

This is the same machinery as the rest of this chapter: build a keep/reject mask over the vocabulary, set every rejected logit to −∞, renormalize, then run argmax or top-p over what survives. The only thing that changes is what decides the mask — a probability cutoff for top-p, grammar validity here.

Full completion, step by step:
<tool_call><function=get_weather><parameter=city>Paris</parameter></function></tool_call>

Illustrative probabilities for a hypothetical grammar over this exact tag format — the same idea as Outlines’ regex/CFG-to-FSM compilation or llama.cpp’s GBNF grammars — not a live run of the model, and not what this repo’s own generation loop does (see below).

Notice the mask isn't always the same shape. Mid-value, almost anything is legal (a value is just text) — the mask barely narrows the field. Right after get_weather's one required parameter closes, exactly one token is legal, </function> — the mask alone decides the next token before temperature or top-p even get a turn. That is the general lesson: grammar masking guarantees the syntax is unbreakable; it says nothing about whether the content — the city name, the reasoning behind it — is any good. Those two are genuinely separate problems.

What masking buys you, measured

OpenAI's own numbers make the case cleanly. When they shipped Structured Outputs (strict: true) in August 2024, they described it as moving from “fine-tune and hope” to “a deterministic, engineering-based approach called constrained sampling.” On their complex JSON-schema evaluation: gpt-4-0613 (2023-era function calling, fine-tuning only) scored under 40%; a newer model, gpt-4o-2024-08-06, with fine-tuning alone reached 93%; adding constrained decoding (Structured Outputs) on top reached 100%. Masking doesn't make the model smarter about what to call — it makes the shape of the call unbreakable.

Does your in-browser Qwen do this?

No. This repository's sampler (crates/mlx-core/src/sampling.rs) implements exactly temperature, top-k, top-p, and min-p — the knobs from earlier in this chapter — and nothing that masks tokens for grammar or schema validity: no logit_bias, no allowed-token mask, no compiled grammar anywhere in the pipeline (confirmed by reading the sampler source and grepping this project's browser/WASM driving code for grammar-shaped keywords — none exist). Instead, tool calls are recovered after the fact: the model generates completely freely, and crates/mlx-core/src/tools/mod.rs — its own doc comment says so directly — “uses simple string-based parsers instead of regex” to scan the finished text for <tool_call> tags once generation is done. The clearest proof there is no hard constraint: the parser's own result type, ToolCallResult, has real failure states — invalid_json, missing_name, parse_error — that only exist because nothing stopped the model from producing malformed output in the first place. A true grammar mask would make those states unreachable by construction; here, they are live code paths. So the tag format chapter 15 showed you is exactly as trustworthy as the model's own learned habit of formatting it correctly — the same category as OpenAI's original 2023 function calling, not the constrained-decoding category that came after it.