GGUF Bundles Models Well, But Tool Grammars and Think Tokens Are Missing

GGUF, the single-file model format used by llama.cpp, consolidates everything safetensors and Ollama spread across multiple files: weights, tokenizer data, chat templates, special tokens, and increasingly sampler configuration. Chat templates are full Jinja2 programs (Gemma 4 ships ~250 lines), forcing every inference engine to embed a Jinja interpreter — Hugging Face transformers uses the Python original, llama.cpp wrote its own, and NobodyWho uses minijinja. A recent addition lets sampler chains and their ordering live inside the GGUF itself via general.sampling.sequence, eliminating the copy-paste-from-README ritual users have long endured.

Three gaps still force model-specific code in every engine. Tool-call formats vary wildly between Qwen3, Qwen3.5, and Gemma 4, and there is no standardized grammar in the file from which a parser could be derived; NobodyWho goes further by generating per-tool constraining grammars for type-safe calls, which hints at the need for a meta-grammar format. Think-token markers, already present upstream on Hugging Face, get dropped during GGUF conversion, leaving engines unable to cleanly separate reasoning from final output. Projection models for multimodal input are also absent from the standard.

The takeaway: GGUF has quietly absorbed most of the metadata an inference engine needs to stay model-agnostic, but tool-calling grammars and think tokens remain the obvious next additions — and the think-token fix is essentially free if someone updates the conversion pipeline.