Skip to main content

Finance AI: why LLMs still hallucinate in production

OpenAI formally proved in 2025 that LLM hallucinations are mathematically inevitable. Here's what that means for building finance AI that CFOs will sign.

Finance AI: why LLMs still hallucinate in production

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

Business Strategy 7 min read

Octopus AI’s CEO Lena Levin published a blunt thesis this week: LLMs can’t do math, and in finance that’s not a minor bug. Her framing is that in marketing, a small error in copy is forgivable, but if your finance AI is off by two dollars, nobody will ever trust the system again. The sarcastic phrase “cook the books” exists for a reason.

She’s right about the outcome. But the explanation she uses is a 2023 framing, and two things happened in the last twelve months that change the reason, and therefore change how you have to build the solution.

The reason changed in September 2025

On September 4, 2025, a team of researchers at OpenAI and Georgia Tech (Adam Tauman Kalai, Edwin Zhang, Ofir Nachum, and Santosh Vempala) published a paper that made something uncomfortable formal: LLM hallucinations are mathematically inevitable. The cause isn’t training data or model size, and you can’t prompt your way out of it either.

Three factors, each with a formal proof:

  • Epistemic uncertainty: when a fact appears rarely in training data, the model learns to associate it with plausible answers, not the truth.
  • Representational capacity: some tasks exceed what the model architecture can represent internally, regardless of scale.
  • Computational intractability: even a superintelligent system cannot solve computationally hard problems in finite time.

The paper also flagged something perverse about how the industry evaluates models: nine out of ten major benchmarks penalize “I don’t know” answers and reward confident answers, even when those confident answers are wrong. In other words, the industry trained models to be more confident, not more honest.

And if that weren’t enough, in July 2025 Varin and Vishal Sikka (former CTO of SAP, former CEO of Infosys, now founder of Vianai) published “Hallucination Stations,” a paper that applies classical complexity theory (the time-hierarchy theorem) to prove the same result from a different angle: beyond a certain computational complexity, an LLM cannot reliably execute or verify its own output. Sikka’s quote was blunt: “There is no way they can be reliable.”

When OpenAI and a former SAP CTO tell you, from two independent papers with formal math, that your model is going to hallucinate no matter what, the conversation stops being about better models. It starts being about architecture.

”But modern models just call Python”

The first objection you’ll hear (and it’s the first comment on Levin’s original post) is that frontier models don’t do arithmetic “in their head” anymore. They call Python, SQL, a calculator, a spreadsheet. They delegate math to deterministic systems and only use language to assemble context.

That objection is correct. And it’s exactly why the problem gets subtler, not simpler.

When a model calls a tool, somebody has to decide: which function to call, what arguments to pass, how to interpret the result, and how to map that result back to the business context. All of those decisions live in language, and language is still where LLMs hallucinate. If the model picks the wrong column from your chart of accounts, or passes last month’s exchange rate, or confuses accrual with cash, the calculator will add the numbers with perfect precision. They’ll just be the wrong numbers.

The problem didn’t disappear. It moved from arithmetic to orchestration. And orchestration is a more dangerous place for an error to hide, because it no longer looks like an obvious math mistake.

Name the pattern: LLM Sandwich

Levin calls her answer “dual architecture”: a deterministic engine for math and logic, an LLM layer for context, language, and code generation. Gyde calls it the “LLM Sandwich”: deterministic layers before and after the LLM, with the language model as the filling.

Two names, same pattern:

  1. Deterministic input layer: validates permissions, normalizes context, enriches with verified data from the authoritative source (ERP, chart of accounts, official FX rates), decides what can and cannot be asked.
  2. LLM layer: does what LLMs are good at (generating structured text, mapping natural language to operations, producing code).
  3. Deterministic output layer: verifies calculations against the authoritative source, blocks any number that lacks traceability, signs each line item with its origin, and leaves an immutable audit trail.

It isn’t a prompt engineering trick. It’s architecture. And what makes it work isn’t a smarter model — it’s assuming the model will lie sometimes and building the system so that lie never makes it into the report.

It’s the same principle we apply when we talk about Kubernetes-style governance for AI agents touching payroll: untrusted code is allowed to work, but it runs inside deterministic primitives that validate every action before and after.

What the market already figured out

If this kind of skepticism sounds like paranoid engineers, look at what CFOs are saying. A Wakefield Research study reported by CFO Dive found that only 14% of CFOs completely trust AI to deliver accurate accounting data without human oversight. Two-thirds consider human oversight of AI agents in finance either extremely or very critical for ensuring accuracy.

That isn’t fear of change. It’s a correct reading of risk.

And they’re not alone. In February 2026, Darren Mowry, a VP at Google, warned that two popular AI business models (LLM wrappers and aggregators) “have their check engine light on.” His point wasn’t that AI is in crisis. It was that companies that just wrap a model in a nice interface and call it a product have no moat, and when the underlying model changes, they collapse. The moat is in the layers around the model, not the model.

That’s the same argument Levin makes in her post: “The moat in this space isn’t a pretty UI. It’s data plumbing, semantic mapping to your chart of accounts, and outputs a CFO can actually sign off on.”

Three independent voices (a finance AI CEO, a Google VP, a paper from OpenAI) saying the same thing from different angles in under six months. When that happens, it stops being opinion and starts being a pattern.

What we do at IQ Source

When we build AI systems for companies in Costa Rica and Latam that touch finance, payroll, regulatory compliance, or any data where an error costs money or trust, the architecture always wraps the model. The LLM is never the last word: it goes through layers that validate permissions and context before, and layers that verify results against authoritative sources after.

It isn’t an aesthetic preference. It’s the honest response to what OpenAI and the Sikkas just proved on paper. If hallucination is permanent, your architecture has to assume it from day one. That’s the difference between a system that works in the demo and one that survives in production, which is the same distinction we’ve been making about unsupervised code generation risks.

The counterintuitive part is this: assuming LLMs will hallucinate forever is not the pessimistic position. It’s the mature one. The naive optimist believes the next model will solve it. The serious engineer designs the system so that it doesn’t have to.

If you’re building AI on top of your ERP, your billing system, or your financial workflow and you’re still waiting for “the next model to be better at math,” it’s worth reviewing the architecture before your first two-dollar error reaches the board. If you want a second opinion on how your LLM is (or isn’t) being wrapped today, get in touch.

Frequently Asked Questions

AI governance AI architecture finance AI LLM hallucinations CFO production LLM

Related Articles

Your AI Wants to Touch Payroll. Kubernetes Knows How.
Business Strategy
· 7 min read

Your AI Wants to Touch Payroll. Kubernetes Knows How.

The engineer who built Azure Kubernetes Service is now Workday's CTO. It's not a hire — it's an architecture signal: container governance is the playbook for AI agents.

AI agents Kubernetes governance