Ask a large language model how confident it is and it will give you an answer. The answer will sound reasonable. It will often be wrong.
This isn't a bug that will be patched in the next model release. It's structural. LLMs are trained to produce fluent, coherent text — and expressing calibrated uncertainty is not the same skill as producing fluent text. On average, across many outputs, the calibration isn't terrible. On any individual output — the one you're about to act on — it can be wildly off. The model that says "I'm fairly confident" about a hallucinated contract clause is behaving exactly as it was trained to behave. It just has no reliable access to the ground truth of its own knowledge gaps.
If you want to know whether to trust a specific output, you cannot ask the model. You have to measure.
There are two principled ways to do this. They suit different deployment contexts, but they produce the same thing at the end: a single number, p, that feeds directly into a decision rule.
Method 1: Ensemble Spread
Run the same task through N model variants simultaneously. Measure how much they disagree.
The intuition is straightforward: if you ask five knowledgeable people the same question and they all independently give the same answer, that answer is probably reliable. If they give five different answers, something about the question is genuinely hard or the information is sparse. The spread of opinions is itself a signal about confidence — one that doesn't rely on any individual's self-assessment.
For AI systems, the variants can be different random seeds, different temperatures, different prompt phrasings, or fine-tuned checkpoints trained on different data splits. What matters is that they're meaningfully independent. When the variants agree tightly, p is low. When they diverge, p is high — and the right response is to escalate rather than act.
When to use it
Ensemble spread works best when you own your compute. Running N variants costs N times the inference, but if you're running on your own hardware, the marginal cost of additional passes is low. At the point of a decision, you're not paying per-token — you're paying per-hour for hardware that's running regardless. N runs effectively cost the same as one.
Calibration check: the rank histogram
To verify that your ensemble is actually calibrated, construct a rank histogram: for each ground-truth outcome, record where it falls in the distribution of ensemble predictions. A well-calibrated ensemble produces a flat histogram — outcomes uniformly distributed across the prediction range.
- U-shaped histogram: outcomes cluster at the extremes. The ensemble is overconfident — expressing certainty it doesn't have. Widen the variance between variants.
- Flat histogram: calibrated. Trust the spread.
- Dome-shaped histogram: outcomes cluster in the middle. The ensemble is underconfident — hedging more than the data warrants. Your p values are being unnecessarily inflated.
Run this check periodically on labelled validation data. Calibration drifts as the world changes.
Method 2: The Interaction Model
The second method is designed for a different cost structure: you're paying per API call, running on a hosted model you don't control, and N inference passes would be expensive. You get one shot at the output. How do you measure uncertainty without running it again?
The answer is a lightweight, task-specific model that learns what normal looks like in your process — and notices when something doesn't.
This is not a "world model" in the grand AI sense. It is deliberately much smaller and much narrower. It learns the expected behaviour of a specific process: the typical sequence of states in an invoice extraction, the normal pattern of transitions in an incident triage workflow, the usual structure of a support conversation at resolution. It doesn't need to understand the world. It just needs to know what this process usually looks like.
At inference time, the Interaction Model predicts the expected next state. When the actual output arrives, it measures the gap between prediction and reality. A large gap — high surprise — means the model did something unexpected. That surprise is your uncertainty signal, computed with a single additional lightweight pass rather than N full inference runs.
When to use it
The Interaction Model is the right choice when API costs make ensembles prohibitive, or when you're running a multi-step agent workflow where uncertainty needs to be tracked step-by-step across a trajectory rather than measured once on a single output.
Calibration check: the surprise trajectory
Track the surprise score across a sequence of interactions over time.
- Falling curve: surprise decreases as the model accumulates context. Healthy — the system is converging. Trust the low-surprise outputs.
- Stuck high: surprise remains elevated. The task has drifted outside the Interaction Model's training distribution, or the underlying model has changed. Do not act autonomously. Escalate and recalibrate.
- Sudden spike mid-sequence: something changed — a new entity, a topic shift, an unexpected input format. Treat as a local uncertainty event and route accordingly.
One Number, One Rule
Both methods are doing the same thing in different deployment contexts. Ensemble spread measures disagreement across parallel runs. The Interaction Model measures surprise in a single sequential run. Both translate into a calibrated p — a probability that this specific output is trustworthy enough to act on autonomously.
Once you have p, the rest is economics: compare it to C/L, the ratio of caution cost to potential loss. If p exceeds that threshold, escalate. If it doesn't, act.
The deeper shift this represents is worth naming. Confidence is not a property of the model — something baked in, something you either trust globally or don't. Confidence is something you measure, from the outside, on every output, before you act. That shift is what makes the difference between an AI system you can rely on and one that will eventually surprise you at the worst possible moment.