Skip to main content

Code is not cheap: AI productivity is a codebase property

Anthropic writes 100% of its code with AI and Google reacted. Pocock and Huryn explain why: AI productivity is a codebase property, not a model property.

Code is not cheap: AI productivity is a codebase property

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

Software Development 12 min read

On Tuesday, April 21, Linas Beliūnas posted on LinkedIn that Google has assembled a strike team led by Sergey Brin to close the gap with Anthropic on coding. The number that triggered the move is what made boards uncomfortable this week: at Google around 50% of new code is written by AI. At Anthropic, close to 100%. Brin stepped in personally because the executive committee had no other explanation for the gap.

The headline reads as an Anthropic win. But it opens a more interesting question. If the model is the same Claude Opus 4.7 that anyone can buy through the API, why can Anthropic run at 100% while Google needs a strike team to reach 50%?

The answer is not that Anthropic has a secret Claude. It is that Anthropic has the right codebase for Claude to run at 100%. Most software teams do not.

The inconvenient witness from the side that teaches

Matt Pocock teaches a course called Claude Code for Real Engineers. He makes a living selling AI to developers. Last week he gave a conference talk where he said something you do not hear in vendor keynotes. He tried the specs-to-code workflow that has been everywhere on X: write a markdown specification, let the AI generate code, never look at the code, only edit the spec when something breaks. His report: each iteration produced worse code than the previous one. More bugs, less coherence, more drift. The AI compiler never converged.

The line he delivered on stage: “Code is not cheap. In fact, bad code is the most expensive it’s ever been.” The reasoning, the kind any CTO with five years of trade should keep on the wall: when AI can ship ten times more code in the same time, a codebase that is hard to change absorbs that speed and converts it into garbage. Speed without architecture is compounding debt.

Pocock’s way out was not a more expensive model. It was three books that have been on the shelf for twenty years. A Philosophy of Software Design by John Ousterhout, which defines complexity as anything in the structure of a system that makes it hard to understand or modify. Domain-Driven Design by Eric Evans, which introduces the ubiquitous language shared by developers, domain experts, and now agents. The Pragmatic Programmer by Hunt and Thomas, with its chapter on software entropy: every change made in isolation, without thinking about the system, degrades the codebase a little more.

What Pocock describes is what every software engineering book of the last twenty years has called good practice. The difference is that before AI, those practices were preference. On a 15-engineer team, a poorly structured codebase cost planning meetings, long onboarding, and Friday-afternoon commits that broke production. The team still shipped. With a poorly structured codebase and an AI writing 80% of the code, the team stops shipping.

The bill from the side that measures

Pawel Huryn is a product manager with 130K subscribers on his product newsletter. On Saturday, April 25, he posted a piece that hit 261K views. The title: “Claude Code’s Limits Are Generous. The Problem Is Your Harness.”

The data point that opens the post is the one CFOs care about. A month ago Huryn was paying $750/month for Claude Code Max 20x plus extra usage. Today he pays $100. Same workflow. Same model. Same prompts. The difference: he cleaned up the harness.

The four levers Huryn names are specific and measurable.

The first one is cache discipline. Every time you change tools or model mid-session, Anthropic’s cache prefix invalidates, and the next turn gets billed as new tokens instead of as cache reads, which cost a tenth. A healthy hit rate sits at 90% on the five-minute cache, and the way you get there is by locking tools and model at session start.

Context budget is the second one. If you let the session grow to the 1M window, every turn pays for 1M tokens, even if you only need 50K of them. Compacting at 50%, separating unrelated work with /clear, and routing mechanical bulk to subagents that do not need the parent’s reasoning all cut the line.

The third lever is model routing. Opus by default costs 5x more than Sonnet and 25x more than Haiku, and most teams pay for Opus on tasks Sonnet would handle just fine. Route every subtask to the cheapest model that can do it well. Mechanical work goes to Haiku, scoped exploration to Sonnet, real planning with tradeoffs to Opus.

Input format is the fourth, and the easiest to overlook. PDFs read as images cost five to ten times what the same content costs as pdftotext. Browsing with the accessibility tree instead of screenshots cuts ~90% of the tokens on web research. These two changes alone collapse the token count on document and web tasks.

None of those four levers required a new model. None required paying more. What they required was treating the harness as a product, with its own measurement and improvement discipline. Which, again, is what a good engineer would call good practice.

The validation that closes the picture

On Sunday, April 26, Guillermo Rauch, CEO of Vercel, wrote that coding agents will be the foundation of all superintelligence. The reasoning: coding ability is indistinguishable from “proficiency with computers”. An agent that masters bash, filesystems, configuration, and program installation can also examine its own source, its state, its instructions, and propose changes to itself. Lee Robinson, also from the Vercel orbit, added the observation that was not obvious a year ago: it was not obvious that an excellent coding agent would also be the path to a general agent for all knowledge work. If the coding agent is the foundation, the codebase that agent runs on is the foundation of the foundation. That layer is not commoditizing soon, because it is the only one that captures the unique context of your business.

The pattern I have been watching since 1990

I have been in this 36 years. I started in 1990, at 15, programming on a Commodore 64 and a Texas Instruments. Since then I have watched the same pattern at least five times: an infrastructure layer becomes cheap, everyone assumes that means it no longer matters, and five years later they discover the layer that sits on top of cheap is what decides who wins.

Five quick reminders before we move on. Around 1995, when Intel made processors affordable to most companies, the people who had not learned Big O or algorithm design discovered fast that cheap CPUs do not compensate for badly written code. Fiber and DSL arrived in the early 2000s at prices that would have looked like science fiction a decade earlier, and frontends that did not optimize payloads kept feeling slow no matter how much bandwidth was available. AWS rewrote the cost of compute starting in 2010, and many teams found out that cheap cloud plus a broken architecture equaled a six-figure invoice before the first quarter closed, not scalability. The SaaS wave of the second half of the decade compressed enterprise software from six-figure licenses to three-figure subscriptions, and badly designed workflows ended up paying for subscriptions that never moved the P&L.

Now the fifth cycle is here. AI tokens are dropping toward the $0.08-per-runtime-hour that Anthropic just set as the reference price, and tier-two models like GLM-5.1 deliver Opus-class performance at a tenth of the cost. The temptation is to repeat the mistake of the previous four cycles. Mistaking the layer that just got cheap for a layer that no longer matters.

What gets cheap each cycle is not the problem. The problem is what sits on top. And what sits on top this time is the codebase. That is the moat of the next five years, and it is exactly the piece Anthropic already has and most teams do not.

The distinction that matters

There is a real difference between the codebase that gets redesigned every year inside a SaaS product and the codebase that is the product. For the first one, an imperfect architecture is technical debt that gets paid eventually; the team still bills subscriptions. For the second one, the codebase is the balance sheet. Every architectural decision is either an asset or a liability, depending on whether it rewards or punishes the agent that will live there for the next five years.

That second group is where this change lands hardest. Software companies whose product lives in the codebase from day one. Industrial automation. AR EdTech. Vertical fintech platforms in LatAm. Medical software. Specialized ERPs. Each one has a domain its team understands better than anyone else, and that is not worth re-learning. But most do not have anyone on the team who has spent the last six months thinking about how to design a codebase so an agent can maintain it without the per-turn cost defaulting to Opus.

This is exactly where last week’s fugazi conversation lands in operation. The one-shot is a demo. The real codebase, where paying customers run real transactions, is still work.

What we do at IQ Source about this

Tech Partner exists specifically for software companies whose product lives in the codebase from day one. It is not staff augmentation. It is not an agency that will put hands on anything. It is a partner that joins the client’s team with a specific thesis: that the codebase, not the model, is the moat for the next five years, and that designing it so it rewards agents instead of punishing them takes trade you buy by the hour.

What we measure in the first map is not complicated, but it takes discipline. Are the modules deep or shallow? A deep module, as Ousterhout described it, has a lot of functionality behind a simple interface. A shallow module has a little functionality behind a complicated interface. Shallow codebases are the ones that drive AI crazy, because to understand something simple it has to walk through twenty different functions. Is there a shared vocabulary between the team and the agents, or does each function call the same thing by three different names? Is the test boundary at the module interface, or scattered across internal details? Does anyone on the team own the harness, measure the cache hit rate, know when to invalidate the prefix, and show the cost-per-feature curve to leadership?

None of those four questions need a new model. All four need someone who has lived in real codebases with real agents long enough to have made the mistakes and corrected the pattern. That is what Tech Partner sells: time already invested, mistakes already made, pattern already stable.

AI Maestro, the other line, applies when the question is not “how do I design my codebase for agents” but “where in my operation does AI fit and where does it not.” The two lines complement each other but answer different questions and get hired at different points in the company’s cycle.

Five questions before signing the next AI productivity check

If the executive committee is about to approve the next AI productivity invoice, there is a short test worth running before the signature.

Start with the most concrete question: does the codebase reward agents or punish them? A week of measurement reveals it. If the AI ships code that gets worse each iteration (the pattern Pocock reported on stage), the codebase is punishing. If generated code integrates cleanly, it is rewarding.

Move to the language. Is the shared vocabulary between the team and the agents captured anywhere? And here the wiki nobody updates does not count. What counts is a living document with domain terms, canonical entity names, and the invariants that must not break. Without one, the AI invents three different versions of the same concept across three different files, and nobody notices until the bug reaches production.

The third checks the tests. Ideally they live at the module interface, not scattered across internal details. When tests sit at the interface, the agent can rewrite everything inside without breaking the contract. When they are scattered, any refactor breaks twenty tests for the wrong reason.

Four and five travel together. Does anyone measure the cache hit rate? Does anyone own the harness, with a real name attached? If the answer to the first one is no, you are paying 10x per turn. If the answer to the second is “the whole team” or “the AI lead”, the operational answer is nobody, and the harness will degrade at the pace of the next Opus.

If your team cannot answer the five with clarity, the next useful conversation is two hours long. We map one of your repos, answer the ones we can answer today, and write down what needs intervention. No quote attached. The address is the usual one: info@iqsource.ai.

Brin stepped in personally. Pocock stopped talking to the model and went back to reading Ousterhout. Huryn cleaned the harness and shrank the bill seven times. The three stories sound different and end in the same place. Code is not cheap. The team that designs the codebase for the next five years wins the next five years. The team that assumes the model will do the work pays the bill of the team that designed.

Frequently Asked Questions

Anthropic Claude Code Matt Pocock Pawel Huryn codebase architecture Tech Partner moat

Related Articles

The enablement layer IS the runtime: Miller's framing breaks
Software Development
· 11 min read

The enablement layer IS the runtime: Miller's framing breaks

Adam Miller says Goose is neutral plumbing and the enablement layer is where engineering belongs. Hannah Stulberg's DoorDash Team OS proves exactly the opposite.

agent architecture Claude Code Goose