Large Language Models (LLMs) are known for "hallucinating"—creating outputs that sound plausible but are factually wrong or nonsensical (more technical definition can be found here). While this can be harmless in creative tasks like storytelling, it becomes a serious issue in structured tasks like automating API calls or database queries.
For example, a travel assistant might book a flight to "Vancouver, Washington" instead of "Vancouver, Canada," or a healthcare tool might format medication dosages incorrectly. These errors can disrupt workflows, erode trust, and harm user experience.
Although hallucinations occur in all LLM-based applications, most research focuses on detecting them in free-text generation. This blog explores how we can detect hallucinations in structured outputs, specifically in function calls, using uncertainty measures like entropy and variance of entropy (VarEntropy). These simple yet powerful metrics can identify errors early, sometimes as soon as the first token.
For our open source project, we built the fastest, and most efficient function calling model. Function calling is a capability that lets LLMs interact with your code and external systems in a structured way. Instead of just generating text responses, LLMs can understand when to call specific functions and provide the necessary parameters to execute real-world actions.
In free-text generation, detecting hallucinations is notoriously hard. For example, if an LLM answers “Napoleon died in 1821” (correct) versus “Napoleon died in 1822” (incorrect), traditional methods struggle to flag the error without external knowledge. The problem worsens when valid answers are phrased differently (“The capital of France is Paris” vs. “Paris”), artificially inflating uncertainty metrics.
Structured outputs, like function calls, flip this dynamic. Consider herms & Qwen’s function-calling syntax:
<tool_call>
{"name": "get_weather", "arguments": {"location": "Seattle"}}
</tool_call>
Here, the schema is fixed: tool names, JSON brackets, and parameter keys follow strict patterns. Hallucinations manifest as deviations from these patterns—a misspelled tool name (get_weathar), an invalid parameter ("tempature"), or a tool call generated when none is needed. Critically, the model’s uncertainty at specific token positions becomes a reliable proxy for errors.
Entropy measures uncertainty. In LLMs, it tells us how "sure" the model is about the next token. When the model is confident (e.g., it knows the next token should be <tool_call>), entropy is low. When it’s unsure (e.g., hesitating between get_weather or get_forecast), entropy is high. High entropy means the model is guessing, which could lead to errors.
VarEntropy goes a step further. Instead of measuring uncertainty at a single token, it tracks how uncertainty changes across a distribution of tokens. If the model’s confidence is steady (e.g., generating a well-structured tool call), VarEntropy is low. If confidence fluctuates (e.g., the model is sure about some tokens but unsure about others), VarEntropy is high. This inconsistency often signals a problem
What do high/low entropy values indicate?
By looking at both together, you can spot when the model might be “hallucinating”:
At first glance, relying on token-level entropy (often called "naive entropy") seems flawed. After all, LLMs can express the same idea in countless ways—why trust a metric that treats “Paris” and “City of Light” as entirely different while they possess the same semantic meaning? (more in-depth discussion on semantic measure can be found here)
But in structured outputs, valid responses are syntactically constrained based on users queries and format, drastically limiting variability. For instance, in the hermes function calling format:
This rigidity means the model isn’t “choosing” how to express an answer—it’s following a template. When hallucinations occur, the model strays from the template, and its token-level uncertainty spikes. For example:
Essentially, structural constraints turn hallucinations into anomalies—statistical outliers in the model’s confidence landscape.
Let’s see how entropy and VarEntropy catch common function-calling errors:
Case 1: Phantom Tool Calls
Query: “What is a cake?”
Valid response: A text answer with cake definition..
Hallucinated response:
<tool_call>
{"name": "find_recipe", "arguments": {"dish": "cake"}}
</tool_call>
Detection: Even if find_recipe is a valid tool, the model’s entropy at the <tool_call> token reveals uncertainty. If the query doesn’t warrant a tool, this token’s entropy will spike compared to queries that explicitly request an API call.
Case 2: Misguided Parameters
Query: “Get insurance information”
Hallucinated response:
<tool_call>
{"name": "claim_info", "arguments": {"id": "12345"}}
</tool_call>
Detection: The LLM model hallucinates on id : 12345 where the token probabilities are skewed due to fine-tuning artifacts, dataset bias or generation settings. The token “1” shows elevated entropy, as the model struggles to reconcile the query’s context with its flawed output.
Case 3: Schema Breakage
Query: “Schedule a demo for tomorrow.”
Hallucinated response:
<tool_call>
{"name": "create_event", "arguments": {"date": 2023-10-12}}
</tool_call>
Detection: The model’s entropy spikes around the unquoted date value, reflecting syntactic uncertainty and internal knowledge uncertainty. VarEntropy across the JSON block further highlights instability.
The beauty of this approach lies in its simplicity. Unlike free-text tasks requiring semantic normalization or external validators, structured outputs let us localize hallucinations to specific tokens. By monitoring entropy at critical points—the first <tool_call> and parameter values —we transform a vague, global problem into a series of solvable, local checks.
Hallucination detection isn’t one-size-fits-all. In free text, we need sophisticated methods to account for semantic flexibility. But in structured domains like function calling, the constraints that make outputs predictable also make hallucinations measurable. By leveraging entropy and VarEntropy, developers can build guardrails that catch errors early, ensuring reliability—one token at a time. As LLMs increasingly automate APIs, databases, and workflows, these metrics offer a practical path to better performance and trust.
In this first part, we introduced function calling, entropy, and VarEntropy, and explained what high/low values mean intuitively. In Part 2, we’ll dive deeper into the practical side:
By the end of our series, you’ll have a clear understanding to implement these techniques in your projects. Stay tuned for actionable insights and reliable hallucination detection!