Skip to main content

Tokens per shipped feature: the new KPI for AI budgets

Peter Steinberger spent $1.3M on tokens in 30 days. Riaz Khan replied on LinkedIn with the KPI that actually measures enterprise AI: tokens-per-shipped-feature.

Tokens per shipped feature: the new KPI for AI budgets

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

Business Strategy 9 min read

This week Peter Steinberger posted a screenshot that most of LinkedIn shared before reading carefully. The image, captured from his own CodexBar app, shows $1,305,088.81 in OpenAI tokens consumed in 30 days. Steinberger is the founder of OpenClaw, the startup OpenAI absorbed in early 2026, and that spend is part of his day-to-day coding with AI. Linas Beliūnas reposted the image with the viral framing: “$1.3M is roughly 7 senior engineers in the US, 16 in Lithuania, 26 in India for a full year.”

The translation works as clickbait. It does not work as economics.

Riaz Khan, a CTO with prior leadership at Thomson Reuters and AWS, said the part that matters later in the same thread, word for word: “Comparing token spend to engineer salaries is the wrong framing. The real question is what output velocity $1.3M in tokens generates versus $1.3M in fully-loaded engineering cost. The metric that matters is tokens-per-shipped-feature, not tokens-per-month.”

This post is the full version of Riaz’s observation, translated into the language a CFO can act on Monday morning: the right KPI for enterprise AI in 2026 is not the monthly token bill. It is tokens per shipped feature. And the gap between those two numbers is the operating discipline that separates teams scaling AI with margin from teams burning it with dignity.

The right question is not in the viral debate

Absolute token spend is the equivalent of asking “how many kilowatts did the factory consume this month?” It is useful for paying the utility company. It is useless for deciding anything. What decides anything is production per kilowatt.

Linas Beliūnas, the same person who reposted Steinberger’s screenshot, published a far more serious essay on his Substack on May 8 that ships as “the system for never hitting Claude’s limits” and reads as a clinical autopsy of the underlying problem. His diagnosis, exactly: “most users are burning the majority of their allocation on architecture mistakes, not actual work.”

The number he puts at the top of the essay: more than 80% of the budget goes into three concrete mistakes. Long conversations re-tokenizing thousands of words of history on every message. Broad file reads when only one function mattered. Defaulting to Opus when Sonnet would have handled the same problem with the same quality. The friction the CFO is paying for does not come from the model. It comes from how the team is calling the model.

That shifts the conversation from the vendor to the customer. Anthropic’s, OpenAI’s, or Google’s published price is no longer the variable that moves the needle. The variable is internal discipline around how each call is assembled. When that discipline is missing, cost per shipped feature climbs even if the model gets cheaper every quarter.

What the PwC paper measured this week

On May 14, five PricewaterhouseCoopers researchers published “Is Grep All You Need? How Agent Harnesses Reshape Agentic Search”. The title sounds academic. The conclusion is budgetary: grep (the plain lexical-search tool that has been in Unix for fifty years) wrapped in a good agent harness matches or beats vector search on coding-agent tasks. The paper measured this across six categories of the LongMemEval benchmark, using Claude Code, Codex, and Gemini CLI as the wrappers.

The line from the abstract that should reach the budget meeting: “overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.”

For a CFO staring at a platform-team proposal with a budget line for “vector database,” the translation is exact. If the proposal starts from the premise that agentic AI needs a vector store by default, it is building architecture the paper just measured as unnecessary for a large category of cases. That budget line is architecture that never ships. It is tokens-per-shipped-feature climbing silently because the team reached for the sophisticated tool before testing the simple one.

This does not mean vector databases are never useful. They earn their keep when corpora are large enough that lexical search breaks on coverage. What it means is: the reflex to buy them by default, without measuring grep first, is exactly the spend pattern your new KPI will flag.

How to calculate tokens-per-shipped-feature

The formula is embarrassingly simple. Take the total token spend for the quarter and divide by the number of pull requests merged to production in that quarter. That is the baseline. The first time you calculate it, the number is going to come out higher than you expected, because it lumps exploratory spend (team training, spikes, discarded prototypes) with productive spend. That is information, not a defect.

The second step is splitting those two columns. Productive spend attaches to a merged pull request. Exploratory spend attaches to a learning session, a proof of concept, an engineer’s weekend reading the new Sonnet release notes. The two columns get different governance: exploratory has a quarterly ceiling treated as a training fund; productive has a per-pull-request gate.

A concrete example, rounded for clarity. A team of six engineers spending about $80K/month in tokens ships 24 features in the quarter. Tokens-per-shipped-feature: roughly $10K. Another team of six engineers, same model, spends about $40K/month and ships 6 features. Tokens-per-shipped-feature: roughly $20K. The second team’s invoice is half. The unit cost is double. If you only look at the monthly total, you are rewarding the wrong team and giving budget to the one most likely to burn it.

I have been in computing for thirty-six years, since 1990, sitting at fifteen years old in front of a Commodore 64 with 64KB of memory that had to be defended byte by byte. I have watched the industry move from “the price of the resource” to “value produced per unit of resource” four times. In 1995, IT departments measured CPU-cycles-per-month and learned the hard way that what actually shifted the conversation was transactions-per-second. In 2008, the metric was $/GB-month for any cloud provider, and it had to migrate to $/business-event before anyone could tell what spend was productive. In 2026, the dashboard says tokens-per-month because that is what the invoice shows. The next version is tokens-per-shipped-feature. Each cycle takes about three quarters to flip. The team that flips first protects margin during those three quarters.

The discipline that keeps the KPI from becoming dashboard furniture

A KPI without operating discipline is a number on a screen nobody opens on the second Monday. The discipline operates in five concrete places when we run this inside a client engagement at Tech Partner. If the agent harness we covered last week is the wrap that gives a commodity model its edge, this KPI is the thermometer that says whether the wrap is tight.

Quarterly budget set in cost-per-feature, not in cost-per-month. The number is a ratio at quarter start, not an absolute ceiling. If the team ships twice as many features, the absolute bill can go up and the KPI goes down. The conversation moves from “we are over budget” to “we are under margin.”

Pull-request gate with delta-tokens reported. Every merged PR carries a field with the tokens consumed during its construction. At sprint close, the team sees the sprint’s tokens-per-shipped-feature without having to pull data manually. Without that visibility, the ratio is computed at quarter end and it is too late to correct.

Model-selection discipline. Sonnet by default for anything that does not require extended reasoning. Opus only when the problem demands it and the engineer justifies it in writing on the PR. The default matters because most engineers do not think about cost when picking a model; they think about “the most capable one available.” That single rule typically drops 30–40% of token consumption with no quality hit.

Short sessions with persistent knowledge in files. Each new session loads the needed context from a persistent file (CLAUDE.md or equivalent) rather than re-tokenizing a hundred-message conversation. This is exactly what Linas recommends on May 8. It is the difference between re-reading a whole book every time you want to quote a sentence and keeping an index you return to.

Clean split between exploratory and productive spend with separate budgets. Exploratory spend lives in a training fund with a quarterly ceiling. Productive spend lives in a cost-per-feature ratio. Mixed, the team ends up justifying experiments as production and the KPI stops measuring what it is supposed to measure.

The five are boring. That is the feature. The operating discipline that holds a KPI in place is not a hundred-page framework; it is five rules written on one page that the team applies every day.

The four questions a CFO should bring to the next AI review

If you sit on the budget side rather than the commit side, these four questions are the only thing you need ready for the next review with your technical team.

What is our current tokens-per-shipped-feature, and who, by name, measures it every Monday? If the answer is “we don’t track that yet” or “the team has it in their heads,” the KPI does not exist. It exists as intuition.

What percentage of last month’s spend went into unnecessary re-tokenization, broad file reads, or default-to-Opus? If nobody has run the audit, the answer is somewhere between 40% and 80%, depending on how honest the audit is willing to be. Running it is one day of work. The number that comes out defines the next budget.

If the model vendor drops price 50% next quarter, what changes in our invoice and what does not? If everything changes, the team is sitting on top of vendor pricing rather than operating discipline. If only a small percentage moves, the team is standing on its own architecture. Both cases are legitimate. Knowing which one you are in is what a CFO is obligated to understand before signing the next renewal.

If I cancel vendor X tomorrow, which artifacts stay on my side and which leave with them? This matters because tokens-per-shipped-feature is only useful as a KPI if what you ship is portable. If your team built all the logic inside a single vendor’s playground, the ratio looks good on paper, but the cost of moving when the vendor changes terms is high.

Four questions. One sheet of paper. The team that answers them confidently this quarter pays for AI with margin next year. The team that does not pays the same bill and learns late why a competitor is doing it cheaper.

Let’s talk about measuring this inside your team

Frequently Asked Questions

tokens AI KPI enterprise AI budget Peter Steinberger Linas Beliūnas Claude Code Tech Partner

Related Articles

Your AI bill comes from places you aren't looking
Business Strategy
· 9 min read

Your AI bill comes from places you aren't looking

Three token-bill incidents in 14 days. The pattern isn't runaway usage. It's surface area: dormant credentials, silent vendor changes, context noise.

AI cost discipline Gemini API Truffle Security
AI doesn't cheapen your product, it changes your margin
Business Strategy
· 8 min read

AI doesn't cheapen your product, it changes your margin

OpenAI launched Deployment Co. Anthropic hit $45B ARR. Stripe embeds 1 AI engineer per 20 employees. Prices aren't falling. The delivery stack changed.

AI Maestro Technology Partner OpenAI Deployment Company