Finance AI: why LLMs still hallucinate in production
Ricardo Argüello — April 10, 2026
CEO & Founder
General summary
Octopus AI's Lena Levin says LLMs can't do math, and in finance that's not a minor bug. She's right about the outcome, but the reason has changed: OpenAI's own researchers formally proved in September 2025 that hallucinations are mathematically inevitable. That single result rewrites how serious teams are now designing finance AI systems.
- A September 2025 paper by researchers at OpenAI and Georgia Tech proved LLM hallucinations are mathematically inevitable, not an engineering defect to be trained away
- Varin and Vishal Sikka's 'Hallucination Stations' paper used classical complexity theory and the time-hierarchy theorem to prove the same result from a different angle
- Only 14% of CFOs completely trust AI to deliver accurate accounting data without human oversight, per Wakefield Research cited by CFO Dive
- The winning architecture is what Levin calls dual-architecture and Gyde calls 'LLM Sandwich' — deterministic layers wrapping the LLM, with traceability and verification a CFO can actually sign
Picture hiring a brilliant financial analyst with one specific flaw: when they don't know the answer, they invent a confident number instead of saying so. You can't fire them because they're the only person who understands the context of the business. So you don't change the analyst. You change the process: every number leaves their desk only after passing through a verified spreadsheet and a two-signature review. That's exactly what serious finance AI is doing today, except the auditors are deterministic software layers instead of humans.
AI-generated summary
Octopus AI’s CEO Lena Levin published a blunt thesis this week: LLMs can’t do math, and in finance that’s not a minor bug. Her framing is that in marketing, a small error in copy is forgivable, but if your finance AI is off by two dollars, nobody will ever trust the system again. The sarcastic phrase “cook the books” exists for a reason.
She’s right about the outcome. But the explanation she uses is a 2023 framing, and two things happened in the last twelve months that change the reason, and therefore change how you have to build the solution.
The reason changed in September 2025
On September 4, 2025, a team of researchers at OpenAI and Georgia Tech (Adam Tauman Kalai, Edwin Zhang, Ofir Nachum, and Santosh Vempala) published a paper that made something uncomfortable formal: LLM hallucinations are mathematically inevitable. The cause isn’t training data or model size, and you can’t prompt your way out of it either.
Three factors, each with a formal proof:
- Epistemic uncertainty: when a fact appears rarely in training data, the model learns to associate it with plausible answers, not the truth.
- Representational capacity: some tasks exceed what the model architecture can represent internally, regardless of scale.
- Computational intractability: even a superintelligent system cannot solve computationally hard problems in finite time.
The paper also flagged something perverse about how the industry evaluates models: nine out of ten major benchmarks penalize “I don’t know” answers and reward confident answers, even when those confident answers are wrong. In other words, the industry trained models to be more confident, not more honest.
And if that weren’t enough, in July 2025 Varin and Vishal Sikka (former CTO of SAP, former CEO of Infosys, now founder of Vianai) published “Hallucination Stations,” a paper that applies classical complexity theory (the time-hierarchy theorem) to prove the same result from a different angle: beyond a certain computational complexity, an LLM cannot reliably execute or verify its own output. Sikka’s quote was blunt: “There is no way they can be reliable.”
When OpenAI and a former SAP CTO tell you, from two independent papers with formal math, that your model is going to hallucinate no matter what, the conversation stops being about better models. It starts being about architecture.
”But modern models just call Python”
The first objection you’ll hear (and it’s the first comment on Levin’s original post) is that frontier models don’t do arithmetic “in their head” anymore. They call Python, SQL, a calculator, a spreadsheet. They delegate math to deterministic systems and only use language to assemble context.
That objection is correct. And it’s exactly why the problem gets subtler, not simpler.
When a model calls a tool, somebody has to decide: which function to call, what arguments to pass, how to interpret the result, and how to map that result back to the business context. All of those decisions live in language, and language is still where LLMs hallucinate. If the model picks the wrong column from your chart of accounts, or passes last month’s exchange rate, or confuses accrual with cash, the calculator will add the numbers with perfect precision. They’ll just be the wrong numbers.
The problem didn’t disappear. It moved from arithmetic to orchestration. And orchestration is a more dangerous place for an error to hide, because it no longer looks like an obvious math mistake.
Name the pattern: LLM Sandwich
Levin calls her answer “dual architecture”: a deterministic engine for math and logic, an LLM layer for context, language, and code generation. Gyde calls it the “LLM Sandwich”: deterministic layers before and after the LLM, with the language model as the filling.
Two names, same pattern:
- Deterministic input layer: validates permissions, normalizes context, enriches with verified data from the authoritative source (ERP, chart of accounts, official FX rates), decides what can and cannot be asked.
- LLM layer: does what LLMs are good at (generating structured text, mapping natural language to operations, producing code).
- Deterministic output layer: verifies calculations against the authoritative source, blocks any number that lacks traceability, signs each line item with its origin, and leaves an immutable audit trail.
It isn’t a prompt engineering trick. It’s architecture. And what makes it work isn’t a smarter model — it’s assuming the model will lie sometimes and building the system so that lie never makes it into the report.
It’s the same principle we apply when we talk about Kubernetes-style governance for AI agents touching payroll: untrusted code is allowed to work, but it runs inside deterministic primitives that validate every action before and after.
What the market already figured out
If this kind of skepticism sounds like paranoid engineers, look at what CFOs are saying. A Wakefield Research study reported by CFO Dive found that only 14% of CFOs completely trust AI to deliver accurate accounting data without human oversight. Two-thirds consider human oversight of AI agents in finance either extremely or very critical for ensuring accuracy.
That isn’t fear of change. It’s a correct reading of risk.
And they’re not alone. In February 2026, Darren Mowry, a VP at Google, warned that two popular AI business models (LLM wrappers and aggregators) “have their check engine light on.” His point wasn’t that AI is in crisis. It was that companies that just wrap a model in a nice interface and call it a product have no moat, and when the underlying model changes, they collapse. The moat is in the layers around the model, not the model.
That’s the same argument Levin makes in her post: “The moat in this space isn’t a pretty UI. It’s data plumbing, semantic mapping to your chart of accounts, and outputs a CFO can actually sign off on.”
Three independent voices (a finance AI CEO, a Google VP, a paper from OpenAI) saying the same thing from different angles in under six months. When that happens, it stops being opinion and starts being a pattern.
What we do at IQ Source
When we build AI systems for companies in Costa Rica and Latam that touch finance, payroll, regulatory compliance, or any data where an error costs money or trust, the architecture always wraps the model. The LLM is never the last word: it goes through layers that validate permissions and context before, and layers that verify results against authoritative sources after.
It isn’t an aesthetic preference. It’s the honest response to what OpenAI and the Sikkas just proved on paper. If hallucination is permanent, your architecture has to assume it from day one. That’s the difference between a system that works in the demo and one that survives in production, which is the same distinction we’ve been making about unsupervised code generation risks.
The counterintuitive part is this: assuming LLMs will hallucinate forever is not the pessimistic position. It’s the mature one. The naive optimist believes the next model will solve it. The serious engineer designs the system so that it doesn’t have to.
If you’re building AI on top of your ERP, your billing system, or your financial workflow and you’re still waiting for “the next model to be better at math,” it’s worth reviewing the architecture before your first two-dollar error reaches the board. If you want a second opinion on how your LLM is (or isn’t) being wrapped today, get in touch.
Frequently Asked Questions
A study published by OpenAI and Georgia Tech researchers in September 2025 proved that LLM hallucinations arise from three mathematical factors: epistemic uncertainty, limited representational capacity, and computational intractability. None is solved by better training, more data, or better prompts, even with perfect training data. In finance that means the answer isn't waiting for better models, it's designing architecture that assumes hallucination is permanent.
The LLM Sandwich architecture places the language model between two deterministic layers. The input layer validates permissions, context, and chart-of-accounts mapping. The output layer verifies the calculations, checks them against authoritative sources, and blocks any number that lacks traceability. It's the pattern companies like Octopus AI and Gyde use to produce financial reports a CFO can actually sign with confidence.
According to a Wakefield Research study reported by CFO Dive in 2025, only 14% of CFOs completely trust AI to deliver accurate accounting data without human oversight. Two-thirds consider human oversight of AI agents in finance extremely or very critical for ensuring accuracy, and the consensus isn't rejection of AI but a demand for architecture with controls.
An LLM wrapper is a chat interface over a base model with no verification layers: you ask and you trust the answer. A production finance AI system includes semantic mapping to the chart of accounts, deterministic validation of every calculation, line-level traceability, and controls a CFO can audit entry by entry before signing any report, even when the model uses external tools.
Related Articles
Your AI Wants to Touch Payroll. Kubernetes Knows How.
The engineer who built Azure Kubernetes Service is now Workday's CTO. It's not a hire — it's an architecture signal: container governance is the playbook for AI agents.
78% of Your Employees Already Use AI Without Permission. Don't Stop Them.
Boris at Anthropic watched it happen: one data scientist opened Claude Code, and within a week the entire floor had it. 98% of companies have shadow AI.