Detecting Hallucinations in LLM Function Calling with Entropy
Co Tran
Machine Learning Engineer
March 14, 2025

Large Language Models (LLMs) are known for "hallucinating"—creating outputs that sound plausible but are factually wrong or nonsensical (more technical definition can be found here). While this can be harmless in creative tasks like storytelling, it becomes a serious issue in structured tasks like automating API calls or database queries.

For example, a travel assistant might book a flight to "Vancouver, Washington" instead of "Vancouver, Canada," or a healthcare tool might format medication dosages incorrectly. These errors can disrupt workflows, erode trust, and harm user experience.

Although hallucinations occur in all LLM-based applications, most research focuses on detecting them in free-text generation. This blog explores how we can detect hallucinations in structured outputs, specifically in function calls, using uncertainty measures like entropy and variance of entropy (VarEntropy). These simple yet powerful metrics can identify errors early, sometimes as soon as the first token.

Function calling: a structured response

For our open source project, we built the fastest, and most efficient function calling model. Function calling is a capability that lets LLMs interact with your code and external systems in a structured way. Instead of just generating text responses, LLMs can understand when to call specific functions and provide the necessary parameters to execute real-world actions.

In free-text generation, detecting hallucinations is notoriously hard. For example, if an LLM answers “Napoleon died in 1821” (correct) versus “Napoleon died in 1822” (incorrect), traditional methods struggle to flag the error without external knowledge. The problem worsens when valid answers are phrased differently (“The capital of France is Paris” vs. “Paris”), artificially inflating uncertainty metrics.

Structured outputs, like function calls, flip this dynamic. Consider herms & Qwen’s function-calling syntax:

<tool_call>  

{"name": "get_weather", "arguments": {"location": "Seattle"}}  

</tool_call>

Here, the schema is fixed: tool names, JSON brackets, and parameter keys follow strict patterns. Hallucinations manifest as deviations from these patterns—a misspelled tool name (get_weathar), an invalid parameter ("tempature"), or a tool call generated when none is needed. Critically, the model’s uncertainty at specific token positions becomes a reliable proxy for errors.

Uncertainty measures

Entropy measures uncertainty. In LLMs, it tells us how "sure" the model is about the next token. When the model is confident (e.g., it knows the next token should be <tool_call>), entropy is low. When it’s unsure (e.g., hesitating between get_weather or get_forecast), entropy is high. High entropy means the model is guessing, which could lead to errors.

VarEntropy goes a step further. Instead of measuring uncertainty at a single token, it tracks how uncertainty changes across a distribution of tokens. If the model’s confidence is steady (e.g., generating a well-structured tool call), VarEntropy is low. If confidence fluctuates (e.g., the model is sure about some tokens but unsure about others), VarEntropy is high. This inconsistency often signals a problem

What do high/low entropy values indicate?

  • Entropy indicates how confident the model is about each token it generates.
    • Low Entropy = The model feels sure about what it’s producing.
    • High Entropy = The model is uncertain or “guessing” more.
  • VarEntropy indicates how stable that confidence is across different tokens.
    • Low VarEntropy = The model’s confidence level stays about the same throughout the output.
    • High VarEntropy = The model’s confidence swings significantly at different points.

By looking at both together, you can spot when the model might be “hallucinating”:

  1. Low Entropy + Low VarEntropy
    • The model is consistently confident.
    • Suggests a high likelihood of correctness.
  2. Low Entropy + High VarEntropy
    • The model is generally confident but has spikes of uncertainty.
    • Often indicates a small error or brief confusion in an otherwise correct output.
  3. High Entropy + Low VarEntropy
    • The model is consistently uncertain about every token.
    • Likely that the model is having too many choices, common in creative writing
  4. High Entropy + High VarEntropy
    • The model is wildly fluctuating in its confidence.
    • Often leads to chaotic or incoherent outputs, suggesting the model is hallucinating.

Naive Approach That Works: Entropy/Varentropy in Structural Formats

At first glance, relying on token-level entropy (often called "naive entropy") seems flawed. After all, LLMs can express the same idea in countless ways—why trust a metric that treats “Paris” and “City of Light” as entirely different while they possess the same semantic meaning? (more in-depth discussion on semantic measure can be found here)

But in structured outputs, valid responses are syntactically constrained based on users queries and format, drastically limiting variability. For instance, in the hermes function calling format:

  • The <tool_call> token must appear if a tool is invoked.
  • JSON keys like "name" and "arguments" are fixed.
  • Parameter values, while flexible, often map to specific types (e.g., dates, IDs).

This rigidity means the model isn’t “choosing” how to express an answer—it’s following a template. When hallucinations occur, the model strays from the template, and its token-level uncertainty spikes. For example:

  1. A query like “Hello, what can you do?” shouldn’t trigger a tool call. If the model hesitates between generating <tool_call> and a direct answer, high entropy at the first token flags a potential hallucination.
  2. A valid tool name like get_weather has low entropy, while a hallucinated get_weathar forces the model to guess unfamiliar tokens, raising entropy.

Essentially, structural constraints turn hallucinations into anomalies—statistical outliers in the model’s confidence landscape.

From Theory to Practice: Entropy in Action

Let’s see how entropy and VarEntropy catch common function-calling errors:

Case 1: Phantom Tool Calls

Query: “What is a cake?”

Valid response: A text answer with cake definition..

Hallucinated response:

<tool_call>  
{"name": "find_recipe", "arguments": {"dish": "cake"}}  
</tool_call>

Detection: Even if find_recipe is a valid tool, the model’s entropy at the <tool_call> token reveals uncertainty. If the query doesn’t warrant a tool, this token’s entropy will spike compared to queries that explicitly request an API call.

Case 2: Misguided Parameters

Query: “Get insurance information”

Hallucinated response:

<tool_call>  
{"name": "claim_info", "arguments": {"id": "12345"}}  
</tool_call>  

Detection: The LLM model hallucinates on id : 12345 where the token probabilities are skewed due to fine-tuning artifacts, dataset bias or generation settings. The token “1” shows elevated entropy, as the model struggles to reconcile the query’s context with its flawed output.

Case 3: Schema Breakage

Query: “Schedule a demo for tomorrow.”

Hallucinated response:

<tool_call> 
{"name": "create_event", "arguments": {"date": 2023-10-12}}
</tool_call> 

Detection: The model’s entropy spikes around the unquoted date value, reflecting syntactic uncertainty and internal knowledge uncertainty. VarEntropy across the JSON block further highlights instability.

Why This Matters: Precision Over Generalization

The beauty of this approach lies in its simplicity. Unlike free-text tasks requiring semantic normalization or external validators, structured outputs let us localize hallucinations to specific tokens. By monitoring entropy at critical points—the first <tool_call> and parameter values —we transform a vague, global problem into a series of solvable, local checks.

Conclusion: Structure as a Superpower

Hallucination detection isn’t one-size-fits-all. In free text, we need sophisticated methods to account for semantic flexibility. But in structured domains like function calling, the constraints that make outputs predictable also make hallucinations measurable. By leveraging entropy and VarEntropy, developers can build guardrails that catch errors early, ensuring reliability—one token at a time. As LLMs increasingly automate APIs, databases, and workflows, these metrics offer a practical path to better performance and trust.

What’s Next?

In this first part, we introduced function calling, entropy, and VarEntropy, and explained what high/low values mean intuitively. In Part 2, we’ll dive deeper into the practical side:

  1. Diving deeper into entropy/varentropy calculations
  2. Setting Thresholds: How to decide what’s "high" or "low" for entropy and VarEntropy, and how domain-specific contexts influence these decisions.
  3. Choosing Thresholds: Our approach to balancing precision and recall, with tips for dynamic adjustments based on task complexity.

By the end of our series, you’ll have a clear understanding to implement these techniques in your projects. Stay tuned for actionable insights and reliable hallucination detection!

References

  1. https://lilianweng.github.io/posts/2024-07-07-hallucination/
  2. https://link.springer.com/article/10.1007/s10676-024-09775-5
  3. Seq log probs
  4. https://arxiv.org/pdf/2302.09664