How LLMs generate structured outputs

Ollama has recently added support for structured outputs. But how exactly does LLMs generate structured outputs?

I have used prompting tricks such as “answer with a JSON only” or showing examples, but they are not very reliable. Ollama can very reliably produce structured outputs, either a valid JSON or any JSON-defined schema. Granted, sometimes an unexpected EOF error happens, but mostly with more complex schemas and small models (e.g. Llama 3.2 3B).

It’s possible to create schemas using pydantic:

from ollama import chat
from pydantic import BaseModel


class Joke(BaseModel):
    id: int
    setup: str
    punchline: str
    category: str | None = None
    tags: list[str] | None


response = chat(
    messages=[
        {
            "role": "user",
            "content": "Tell me a funny joke",
        }
    ],
    model="deepseek-r1",
    format=Joke.model_json_schema(),
)

country = Joke.model_validate_json(response.message.content)
print(country.model_dump_json(indent=2))

Output:

{
  "id": 1,
  "setup": "Why did the chicken cross the road?",
  "punchline": "To get to the other side!",
  "category": null,
  "tags": ["chicken", "road", "joke"]
}

Grammars

Ollama uses llama.cpp under the hood to run LLMs. Structured outputs, such as a valid JSON, are possible using constrained decoding. This works by modifying how next tokens are selected: the model is only able to choose tokens that do not violate the grammar rules.

We can use the llama_cpp Python library to convert the JSON schema above to a grammar:

category ::= string | null
category-kv ::= "\"category\"" space ":" space category
char ::= [^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
id-kv ::= "\"id\"" space ":" space integer
integer ::= ("-"? integral-part) space
integral-part ::= [0-9] | [1-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9])?)?)?)?)?)
?)?)?)?)?)?)?)?)?)?
null ::= "null" space
punchline-kv ::= "\"punchline\"" space ":" space string
root ::= "{" space id-kv "," space setup-kv "," space punchline-kv "," space tags-kv ( "," space ( category-kv ) )? "}" space
setup-kv ::= "\"setup\"" space ":" space string
space ::= " "?
string ::= "\"" char* "\"" space
tags ::= tags-0 | null
tags-0 ::= "[" space (string ("," space string)*)? "]" space
tags-kv ::= "\"tags\"" space ":" space tags

However, we are not limited to JSON outputs. For instance, if we want to rate movies, we can create a grammar that allows only valid ratings as outputs:

from llama_cpp.llama import Llama, LlamaGrammar

grammar = LlamaGrammar.from_string(
    """
    root ::= "5.0" | leading "." trailing
    leading ::= [0-4]
    trailing ::= [0-9]
    """
)


llm = Llama.from_pretrained(
    repo_id="MaziyarPanahi/Llama-3.2-3B-Instruct-GGUF",
    filename="Llama-3.2-3B-Instruct.Q8_0.gguf",
)

response = llm("Rate the movie Dune: Part Two (2024)", grammar=grammar, max_tokens=-1)

print(response["choices"][0]["text"])

Output:

3.5

This bridges an important gap in LLM usage in general by reliably generating output in a structured format. The unstructured strategy of parsing strings with any content is error-prone. Not only that, but there are claims that structured outputs with constrained decoding outperforms unstructured outputs in some tasks.

How LLMs generate structured outputs

Gram­mars

Grammars