AI price per token lies. Measure cost per job.
Ricardo Argüello — June 21, 2026
CEO & Founder
General summary
A study from Stanford, Berkeley, CMU and Microsoft ran eight reasoning models across twelve tasks and compared the list price with the actual bill. In nearly a third of the matchups, the model that was cheaper per token cost more per job, up to 28 times more in the worst case. Gemini 3 Flash is listed 80% cheaper than GPT-5.4 and cost 38% more to run. The price per token is a marketing number. The bill is a behavior number, and they are rarely in the same order.
- You do not pay per question, you pay per token, and every model burns a different amount to solve the same thing. One spent 60,000 reasoning tokens on a problem another solved in 25.
- On an agent task, one model took 57 steps where another took 7. Cheaper per token, more expensive per finished job.
- The same model on the same query varied in cost by up to 9.7x between runs. That variance makes a flat fee impossible to quote without exposing your margin.
- Your real cost is not the list price, it is the list price times consumption, and consumption is variable, model-specific and partly random.
- The competence that matters is measuring cost per finished task and routing work to the right model. That is what IQ Source builds when it builds on models, not picking the cheapest row in the table.
Imagine comparing two taxis by their per-kilometer rate. One charges half what the other does, so you pick it. What the rate does not tell you is that this taxi takes the long route, stops at every light, and sometimes circles the block, while the expensive one goes straight there. By the end of the trip, the cheap one cost you more. AI models work the same way: the price per token is the per-kilometer rate, but you pay for the whole trip, and every model drives differently.
AI-generated summary
The price per token is a marketing number. The bill is a behavior number. And they are rarely in the same order.
That is the thesis of this post, and it has a direct consequence for anyone building on AI or budgeting its spend: picking a model by the price on the table is picking by the wrong number. The one that is cheaper per token can cost you more per finished job, sometimes by a lot. The competence that matters is not finding the cheapest model in the list, it is measuring what each task actually costs and routing the work to the right model. That is what we build when we build on models, and the rest of this post explains why.
The number that lies, with data
Serge Herkül, who advises SaaS companies on pricing, laid it out with a case that stings: Gemini 3 Flash is listed 80% cheaper than GPT-5.4. Run across twelve real tasks, it costs 38% more.
It is not a fluke. Herkül cites a study from Stanford, Berkeley, CMU and Microsoft, titled “The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More,” that ran eight reasoning models across twelve tasks and compared the list price with the actual bill. In nearly a third of the matchups, the “cheaper” model cost more. In the worst case, 28 times more.
The details explain it. One model spent 60,000 reasoning tokens on a problem another solved in 25. On an agent task, one took 57 steps where another took 7. And the part that hurts most when you are trying to budget: the same model, on the same query, varied in cost by up to 9.7x between runs.
Strip the AI out of it and you are left with a pricing lesson as old as commerce: unit price is not total cost.
Why the cheap one runs expensive
The mechanics are simple once you see them. You do not pay per question. You pay per token. And every model burns a different number of tokens to reach the same answer.
A model with a low list price can be a model that overthinks. It reasons out loud for thousands of tokens before answering, or it spirals into extra steps when acting as an agent, or it rereads the same context again and again. Every one of those tokens costs money, even if each individual token is cheap. The expensive-per-token model sometimes cuts straight to the point, spends a fraction of the tokens, and ends up cheaper per job.
Then there is the variance. A model that swings nearly tenfold in cost on the same query between two runs means you cannot even assume a stable average. Cost per task is not a point, it is a distribution, and the tail of that distribution is where the money goes.
What this breaks in your business
If you build a product on LLMs, this hits you in two ways, and it pays to see both clearly.
The first is your cost to operate. Your cost of goods sold is not the list price. It is the list price times consumption, and consumption is variable, model-specific and partly random. If you modeled your margin on the number from the table, you modeled the wrong number. I wrote about this from another angle in the post on the hidden cost lever in enterprise AI: timing, where batching, caching and scheduling move the bill as much as the model does.
The second is how you charge. If you put a flat fee on top of a variable cost, your heaviest users go underwater without you noticing. You handed your margin to a random number generator. This connects straight to something I already argued: in AI, you are what you charge for. Charging for the outcome only works if you know what producing that outcome costs you, and this is the part almost nobody measures.
And no, capping the spend does not fix it. I covered that in a $1,500 cap does not cure your AI bill: the cap treats the symptom. The cause is not knowing which task runs on which model at what real cost.
What IQ Source does about it
The way out is not to pick the cheapest model or the most expensive one. It is to stop choosing by the price table and start choosing by cost per finished task in your own workflow.
That demands a discipline almost nobody has set up. You have to run each candidate model over your real tasks, not over a generic benchmark, measure how many tokens and how many steps it consumes to completion, look at the tail of the distribution and not just the average, and route each type of work to the model that solves it cheapest end to end. Sometimes the expensive frontier model is the most economical for the hard task, and an efficient model is enough for the routine one. The only way to know is to measure it in your context.
At IQ Source, that is part of what we build when a company puts us to work standing up AI on top of their operation. We do not hand over “use this model.” We hand over a routing table built on your tasks, with cost per job measured, not estimated. It is the difference between buying by the label and buying by the bill.
The next time someone on your team proposes switching models “because it is cheaper,” ask one concrete question: cheaper per token, or cheaper per finished task? If the answer is “per token,” you still do not know what it will cost. You will find out on the bill, which is the only number you actually pay.
Measure your AI cost per task, not per tokenFrequently Asked Questions
Because you do not pay per question, you pay per token, and each model burns a different amount to solve the same thing. A model that is cheap per token can spend thousands of reasoning tokens or take dozens of steps where another finishes in a few. The list price measures the input, the bill measures the behavior, and they rarely match.
It is when the model that is cheaper per token turns out more expensive per finished job. A study from Stanford, Berkeley, CMU and Microsoft measured it across eight reasoning models and twelve tasks: in nearly a third of the matchups the lower-list-price model cost more to run, up to 28 times more in the worst case.
By measuring cost per finished task in your own workflow, not the per-token price on the table. You run each candidate model over your real tasks, measure how many tokens and steps it consumes to completion, and route each type of work to the model that solves it cheapest end to end. The list rate is only the starting point.
Because token consumption is variable and partly random: the same model on the same query can vary in cost by up to 9.7x between runs. If you charge a flat fee on top of that, your heaviest users quietly go unprofitable, and you hand your margin to a random number generator instead of pricing the real cost.
Related Articles
Your most certain expert blocks AI adoption
Altman said the most credible scientists held AI back through certainty. The same thing happens in your company: the surest person is often the biggest brake.
In AI, You Are What You Charge For, Not What You Install
Joe Pine puts it bluntly: you are what you charge for. Charge for the tool and you're in the tool business. Charging for the outcome forces the change to actually happen.