Skip to main content

Your agent invoiced $0.00. The logs never saw it.

A sales agent wrote a flawless $0.00 invoice for a 14-seat deal. The pricing API was correct. Every log was clean. Only the full execution trace explained it.

Your agent invoiced $0.00. The logs never saw it.

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

Business Strategy 8 min read

Last week a sales agent drafted a $0.00 invoice for a 14-seat enterprise plan.

Not blank. Not a null. Not a system error. Correct line items, correct customer, correct dates, and a total of zero dollars written with the same confidence it would have used for any other figure. If the account exec had been moving fast and hit “approve,” that quote ships.

The story is real. A team posted it on the r/AI_Agents forum, and Fabio Marcello Salvadori amplified it on LinkedIn. It is worth telling in full, because the ending is not the one you expect.

And because the failure mode that actually stalls AI adoption inside companies is not the loud hallucination, the one that spits out something absurd you can spot from a mile away. It is this one: a flawless, well-formatted output that passes every review except the only one that matters. Is this number real?

Printf debugging is dead for agents

The team did what anyone experienced would do. First guess: the pricing API returned a zero. It had not. The logs showed the correct number came back. The agent simply decided not to use it.

They checked the prompt. Unchanged, the same one that had run for three months. They ran the same input through staging and got the right invoice. Could not reproduce it. They filed it as a one-off model hiccup and moved on, until it happened twice more that same day.

When they finally pulled the full trace of a failing run, there it was: a step nobody had put there on purpose. After calling the pricing tool, the agent had run its own “validation” against a contract object the team had dropped into context weeks earlier, for an unrelated and long-forgotten feature. That object had a discount field that was always empty for those customers. The model read empty as a 100 percent discount and wrote $0.00 with full conviction.

Here is the part that should bother anyone shipping agents to production. None of the individual logs would have caught this. A classic printf would have shown the pricing tool returning the right number, and then the output mysteriously going to zero. The only reason they found the validation step is that it showed up as its own span in the trace, sitting between the tool call and the final synthesis.

I have been debugging systems since 1990. For thirty-six years, a bug lived in a line I could point to. I wrote it wrong, or someone else did, but it was there, in the code, waiting. The real change with agents is not that they fail more. It is that the bug now lives in a step nobody wrote. The model improvised it in the middle of your flow, and if you are not recording every move, that step is invisible.

That is why observability stopped being optional. Full execution traces, the kind tools like Langfuse produce, are not an upgrade you bolt on after the first scare. They are the condition for putting an agent anywhere near something that matters. Sentry’s observability team puts it in one line: dashboards show totals, traces show decisions. A dashboard tells you the invoice ran and returned 200. The trace tells you that, on the way, the model made a decision no one asked for.

A null nobody explained to the model

The root cause was not the model. It was the old object.

Someone left a contract in the agent’s context for a feature that no longer existed, and nobody took it out. It sat there, like the cables nobody dares unplug from the rack because they cannot tell where they go. A field set to null, with not one instruction on how to read it, in front of a model that by nature fills gaps with the smoothest interpretation it can find. Empty became free. Free became $0.00.

The context you hand an agent is not a bag where you toss everything just in case. It is attack surface. Every object you leave there is an assumption the model can grab at the worst possible moment, in the direction you least expect. Context hygiene, deciding what goes in, what comes out, and when it gets cleared, is an architecture decision, not an implementation detail. It is the same principle I wrote about when a nine-second failure turned out to be architecture, not the agent: the model is rarely the problem, the scaffolding is.

Now, the internet’s reaction to this story was as instructive as the story itself. Under the post, the comments filled up with capital-letter solutions and registered trademarks: compiled intent doctrines, arbiter kernels, sovereign provenance signed with cryptography, layers that sever the TCP socket if a byte does not match. A lot of armored-vault physics for a problem that was, at bottom, a forgotten object in a variable.

Salvadori himself proposes something more reasonable: that before a high-impact action, the agent declares which data and fields it will use, and a lightweight verifier blocks anything outside that contract. That is a sound idea. But it is worth remembering how the real case got fixed, because it was embarrassingly cheap. They pulled the contract object out of the invoicing path, added an eval that flags any invoice under a threshold for explicit review, and kept the trace layer. They shipped it in an afternoon, once they knew where to look.

That is the lesson that gets lost in all the acronyms. The value was not in a new protocol. It was in seeing the problem, and in a ten-line eval that asked the one thing nobody had asked the system to ask: does a zero-dollar invoice make sense here?

Human review is not training wheels

The only thing that stopped that invoice from going out was a person reviewing before approving.

And there is an idea worth saying plainly, because a whole crowd sells human review as a passing embarrassment, a patch you will rip out once the model “matures.” It is the other way around. Your company already requires several people to review a material money decision before it is final. Nobody in finance signs a big check just because the math “looked right.” So why should an agent run with less oversight than a human employee for that same decision?

This is not distrust of AI. It is the same discipline you would apply to any system with authority to move money. Anthropic, in its own guide to building agents, recommends that the agent pause for human approval at checkpoints, especially when the action has consequences. When the lab that makes Claude tells you to keep a human at the gate, it is worth not treating that gate as paperwork you will delete next quarter.

McKinsey calls the broader pattern the gen AI paradox: nearly eight in ten companies use gen AI, and just as many report no real bottom-line impact, with most use cases stuck in pilot. The reason is rarely that the model cannot do the task. It is that nobody trusts an output that clears every test except reality. The $0.00 invoice is that statistic in miniature. The proof of concept worked, right up to the day it nearly billed an enterprise customer for free.

What we do about this at IQ Source

When a company asks us to put an agent into a process that touches revenue, the first question is never “which model do we use?” It is “how will we see what it does when it gets something wrong?” Because it will get something wrong, and the day it does, the difference between a scare and a disaster is whether the trace was on.

AI Maestro is the discovery where that gets decided up front, not after the first incident. We map the real process, not the org-chart version, and pinpoint exactly which steps a model could slip its own logic into between yours. Out of that come four concrete things we leave installed before the agent touches a single customer: full execution traces so you can see every span, automated evals that flag amounts and results outside their range, a human gate on every high-impact action, and context hygiene so no stale object is left contaminating the model.

It is the flip side of not letting AI own the business’s source of truth: the correct number lives in a deterministic system, and the agent uses it, never reinvents it. And it goes hand in hand with asking, before anything else, whether the process even needs an agent. Plenty do not, and a deterministic flow would never have invented a 100 percent discount.

Next time someone on your team shows you an agent “ready for production,” do not ask whether it works in the demo. It almost always works in the demo. Ask one thing: when this agent does something nobody asked for, between two steps we do control, will we see it? If the answer is an uncomfortable silence, it is not ready yet. It is waiting for its own zero-dollar invoice.

See where your agent could invoice $0.00

Frequently Asked Questions

AI agents AI observability AI governance execution tracing Langfuse AI Maestro human in the loop

Related Articles

The pyramid before the agent: you almost never need one
Business Strategy
· 7 min read

The pyramid before the agent: you almost never need one

The 'AI consultant' title expires. What stays is a discipline: start with a deterministic workflow and climb to an agent only when the problem demands it.

automation pyramid AI agents deterministic workflows
Kirkland Isn't Building a Model. It's Building a Layer.
Business Strategy
· 7 min read

Kirkland Isn't Building a Model. It's Building a Layer.

Kirkland & Ellis is putting $500M into AI and half of LinkedIn misread it: they aren't training a model, they're building the layer that sits on top of it.

Kirkland & Ellis AI moat model vs layer