Anthropic measured its own AI. Can you prove yours?
Ricardo Argüello — June 8, 2026
CEO & Founder
General summary
Anthropic published numbers about itself this week: Claude writes most of the code merged inside the company, north of 80% by this week's coverage, engineers ship around 8x more code per quarter than they did in 2021-2025, and the company frames it as a possible path to recursive self-improvement. Strip the science fiction and the important part stays: a company put a falsifiable output number on the table about itself. Meanwhile most of the industry is still flexing inputs: a customer burning 100 billion tokens a month, Copilot seat counts, $30M rounds with a Cursor license as the entire AI strategy. The new dividing line isn't who uses AI. It's who can prove it produced anything.
- Anthropic reports Claude now writes the majority of code merged at the company, above 80% by this week's coverage, and that its engineers ship roughly 8x more code per quarter than in 2021-2025.
- The radical part isn't the recursive self-improvement headline. It's that a company published a falsifiable output number about itself instead of bragging about how much it consumes.
- The rest of the industry flexes inputs: 100 billion tokens a month with no output named, Copilot seats counted as adoption, rounds raised with a Cursor license on the cover.
- Measuring output is harder than counting seats: you need a baseline and a definition of 'reached the customer.' That's why almost nobody answers the question, and why whoever does has a real edge.
- AI Maestro from IQ Source builds that baseline before accelerating: it maps the real processes, scores each one, and runs a Go/No-Go gate on outcomes, not on seats purchased.
Picture two factories with the same enormous electricity bill at month end. The first one brags about the size of the bill in a board meeting, as if spending a lot were the achievement. The second counts how many units came off the line with that same electricity. Only one of them knows whether it is making or losing money. Anthropic just made its units-per-electricity number public. Almost everyone else is still flexing the size of the bill.
AI-generated summary
This week Anthropic did something unusual for an AI company. It published numbers about itself.
Claude now writes the majority of the code merged to production inside the company, north of 80% by this week’s coverage. And its engineers, on average, ship roughly 8x more code per quarter than they did across 2021-2025. Anthropic frames this as a possible path to recursive self-improvement, AI accelerating the development of the next AI, and says it is happening faster than they expected.
The headline ran off with the science fiction, as headlines do. Some celebrated the apocalypse, others replied that Anthropic’s own staff must be depressed. And the company itself hit the brakes inside its own text: achieving recursive improvement alone, it wrote, does not by itself imply an immediate change in how industrial production or society is organized.
Strip all of that away and what’s left is the part that matters for your business. The radical thing isn’t the robot that improves itself. It’s that a company put an output number about itself on the table, one you can falsify, instead of flexing how much it consumes. That’s the thesis here: the new dividing line isn’t who uses AI. It’s who can prove it produced anything.
The number, not the robot, is the news
Set aside, for a second, whether AI will improve itself. That’s a dinner-table debate, and nobody on your board is going to act on it Monday morning.
The actionable thing is different. Anthropic said “8x more code per quarter” and “over 80% of merged code.” Those are output numbers. You can argue with them, audit them, even debunk them. Someone can ask “merged code or code that reached the customer?”, “8x measured how?”, and the question lands precisely because there is a number to argue against.
That’s the part almost nobody copied from Anthropic, and it’s the only part worth copying. Not the recursive self-improvement. The willingness to say a production number out loud, knowing someone will check it.
Because most of the companies that “adopted AI” this year don’t have a number like that. They have a feeling. They have an invoice. They have a dashboard full of tokens. What they don’t have is a single figure that proves AI got something to a customer faster or cheaper than last year.
Everyone else is flexing inputs
Mark Ajzenstadt, who runs a services company that embeds AI engineers inside product teams, put his finger on it this same week. His list is worth reading, because it’s the exact mirror image of what Anthropic did.
OpenAI’s CEO on stage flexing that a customer burns 100 billion tokens a month, with no mention of what it produced. Consulting firms billing millions for AI strategies written by people who never shipped a production agent. CTOs reporting “AI adoption” to their boards by counting Copilot seats, while nobody tracks what reaches production. Startups raising $30M rounds with “AI-native” in the deck and a Cursor license as the entire AI strategy.
Every item on that list is an input metric. Tokens consumed, seats bought, money spent, rounds raised. None of them says a word about what came out the other end.
The line Mark closes his thread with is the one that stuck with me: “I know our cost per merged PR.” One sentence, and it exposes the whole list above it. He doesn’t flex how many tokens he burns. He knows how many pull requests he closes for that spend. That’s the difference between knowing what you pay and knowing what you produce, and almost nobody on the other side of that list knows it.
Measuring output is hard. That’s why almost nobody does it.
There’s an honest reason so many people stop at input metrics: they’re easy. Counting Copilot seats is a spreadsheet. Adding up the token bill is something the vendor does for you. Neither one requires you to define what “done” means.
Measuring output does. To say “we produced 8x more” you need two things most teams don’t have: a clear definition of what counts as “reached the customer,” and an honest baseline from last year to compare against. Without those two, there’s no number, there’s an anecdote.
And here’s the trap that makes it slipperier: tests passing doesn’t mean something was worth doing. I wrote about that when Opus 4.8 shipped, about how a thousand agents can finish the wrong task with the whole test suite green. A green dashboard is an input metric wearing the costume of a result. It tells you the system ran, not that it produced something a customer needed.
That’s why neither panic nor euphoria helps. Both are ways of dodging the boring question. The boring question is: can you state, today, one honest number for what your AI got to a customer? If the answer is “let me check the token dashboard,” the answer is no.
What we do about it at IQ Source
When a company asks us to accelerate with AI, the first thing we ask for isn’t access to their tools. It’s their baseline. How much did you produce before AI, measured in something the business actually cares about? If it doesn’t exist, that’s the first job, before anyone touches the accelerator. Because accelerating with no baseline leaves you exactly where half the industry is: spending more, unable to prove anything changed.
AI Maestro is the discovery where that baseline gets built. Two months mapping the real processes of your operation, not the org-chart version, scoring each with an AI Opportunity Score, and ending in a Go/No-Go gate process by process. And the gate is decided on outcomes that reach the customer, not on seats bought. I covered the concrete metric we install separately, so this isn’t theory: cost per shipped feature, not tokens per month. That’s the figure that separates the team scaling with margin from the team burning with dignity.
Anthropic measured itself this week and published the number. You don’t have to believe in recursive self-improvement to take the lesson. The lesson is simpler and more uncomfortable: next time someone at your company celebrates that AI now writes half the code, or that the marketing team uses five new tools, ask one thing before you clap. Show me the output number. If only the invoice shows up, you didn’t prove anything. You just spent with style.
Build the baseline that proves what AI producesFrequently Asked Questions
Anthropic reported that Claude now writes the majority of the code merged to production at the company, above 80% by this week's coverage, and that its engineers ship roughly 8x more code per quarter than they did in 2021-2025. The company framed it as a possible path toward recursive self-improvement in AI development.
Because tokens, Copilot seats, and the size of the invoice measure what goes into the system, not what comes out. Two teams with the same AI spend can have opposite results. An input metric looks impressive in a board meeting and says nothing about whether anything reached a customer. It is adoption flexed, not output proven.
The one that divides real output by spend: cost per shipped feature or per merged pull request, not tokens per month. That metric separates the team scaling AI with margin from the team burning it with dignity. It requires a clear definition of 'reached the customer' and a baseline to measure the next quarter against.
AI Maestro from IQ Source builds the baseline before accelerating: across two months it maps the real processes of the operation, assigns each an AI Opportunity Score, and applies a Go/No-Go gate that is decided on outcomes reaching the customer, not on seats bought or tokens burned.
Related Articles
A $1,500 cap on AI treats the symptom, not the cause
Uber capped AI spend at $1,500 per person and one company burned $500M on Claude in a month. The cap treats the symptom. The cause is agents turned loose with no scope.
Peak AI confidence, and the downslope nobody owns
Building AI has never been cheaper, so the bet is to build. But 95% of pilots move no P&L, and in most companies nobody owns the downslope of the curve.