Skip to main content

Agents ship PRs while you sleep. Who reviews them?

Cognition paid $250M for Windsurf so Devin can open PRs while you sleep. Cursor bet the opposite. The real bottleneck is review capacity, not code.

Agents ship PRs while you sleep. Who reviews them?

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

Software Development 7 min read

Aakash Gupta ended a thread yesterday with a single line that reads better as a market thesis than a take:

Whoever solves review at scale wins this entire market.

The setup: OpenAI offered $3B for Windsurf. Google paid $2.4B for the CEO and researchers. Cognition picked up what was left — the product, the IDE, 350 enterprise customers, $82M in ARR — for around $250M. Aakash called it the best deal in AI coding. He is right, but not because of the multiples.

He is right because of the conversion that deal just made possible.

Two bets going in opposite directions

Cognition already had Devin, the autonomous agent that runs on its own VM with a desktop, browser and full computer use. Devin went from $1M ARR in September 2024 to $73M ARR by June 2025. Strong technical adoption. One big product problem: an autonomous agent that lives outside your editor is an agent you forget to check on. Nobody leaves the IDE to monitor a separate tool.

Windsurf gave Devin the front door it was missing. Windsurf 2.0 turns the IDE into a Kanban board for agents. You plan locally, hand off tasks with one click, close your laptop, and wake up to opened PRs.

That is the point: it is the first coding product that explicitly tells the developer to stop watching.

Cursor went the other direction. Cursor 3.0 shipped this month with an agent-first interface for managing parallel AI fleets locally. Every agent needs your screen open. Cursor hit $2B ARR in February, doubling in three months. The product gets better the more attention you give it.

Cognition did the opposite: the product gets better the less attention you give it.

The revenue gap is still huge — $2B against roughly $155M combined for Cognition plus Windsurf. But the categories have already split. Cursor sells a faster coding experience. Cognition sells engineering capacity, measured in cloud hours.

Both bets collide with the same wall.

The wall is called 90%

Chamath Palihapitiya named it yesterday with the bluntness he is known for: 90% of the code inside a company is maintenance and migration of complicated, messy existing systems. Vibe coding handles the other 10% — greenfield, simple, starting from a blank file.

Chamath gets to say this with authority. He built 8090, a company dedicated specifically to the 90%. Fortune 500 companies paying them to migrate complex systems, rewrite old ones, maintain decisions nobody documented. Not theory. His actual business.

Nikunj Kothari put it in a picture. On April 14 he posted Slack’s notification system diagram: forty-plus decision nodes, channel muted, user in DnD, @channel mentions suppressed, subscribed threads, global prefs, channel prefs, mobile push timing. His caption is short: every time he sees a tweet claiming “I can vibe code this in a weekend” he thinks of Slack’s notification flow. It is not a rhetorical example. It is a map. Nobody one-shots that map. Not even with a perfect model.

A reply to Chamath made the point sharper: the 10% is vibe-codeable because you get to define the interface. The 90% is hard because the interface was defined by a thousand decisions made before you arrived. Chesterton’s Fence applies to every legacy system. You cannot refactor what you do not understand.

Put those two observations together and the conclusion is obvious: the future where Devin opens PRs overnight lands directly on code nobody on the team fully understands. Which brings you back to the line Aakash left at the end of his thread.

The bottleneck moved

A month ago I wrote about PR-level code review. That post’s problem was quality: the author accepts a Copilot suggestion, the reviewer opens a diff nobody on the team wrote, and the explanation chain breaks. That problem is still live.

What just showed up is a different problem, one layer up.

If five Devin agents ship overnight, you wake up to PRs in five repos touching logic you have never read. It is not that the author cannot explain. It is that there is no author available when the stack lands. The bottleneck moves from writing to reviewing. And reviewing agentic code produced at volume is a skill most teams never built, because for thirty years the supply of PRs was capped by what humans could write.

That cap just got removed. Review demand did not.

That is why Aakash’s line is the market thesis. Whoever builds the infrastructure that makes a fleet of agents reviewable captures this cycle. Not whoever ships the fastest agent.

Three debts, one liability

I have been naming the same mechanic under different labels for two posts in a row:

  • Taste debt, April 13. Pulling the human out of the loop too early bills you in brand, in unexamined decisions, in customers who stop recognizing your voice.
  • Consumption debt, April 15, the Uber burned-AI-budget post. Agents running in loops without circuit breakers show up in the P&L directly.
  • Review capacity debt. This post. Agents produce more PRs than your team can review with judgment, and the invoice arrives in production incidents, high MTTR and affected customers.

Different sides of the same liability. All amortize the same way: human review wired into the flow, not bolted on top. Shared diagnosis: the mental model we inherited — SaaS per seat, reviews during business hours, annual estimates — was designed for a world where production was capped by humans. That world is gone.

What works for review at scale

Three things that scale, and one that does not.

Risk-weighted sampling. Reviewing every agent-opened PR with the same weight is mathematically impossible. Reviewing in bands works: any change touching a critical domain — billing, auth, customer data — goes through full human review; low-band changes like renames or dependency updates with solid tests behind them run on random sampling. Sampling does not catch every bug. It catches the ones worth stopping a merge for.

Hard invariants the agent cannot skip. Contract tests, property tests, mutation testing. They do not depend on a human reviewing on time. They depend on the agent being unable to land in production unless an objective condition passes. This is what Shopify is building out: agents open PRs, but the merge requires specific gates to pass before a human even opens the diff.

Blast-radius limits per execution. Explicit cap: an agent cannot touch more than N files, more than M modules, or cross layer boundaries without human approval. The goal is not to block work. It is to contain blast radius when the agent is wrong. Five PRs each touching their own service is manageable. One PR refactoring five services in one shot is not reviewable by anyone with finite time.

And the thing that does not work: hiring more reviewers. If your PR supply is agentic and exponential and your reviewer supply is human and linear, you lose the math race on day one.

What we do at IQ Source

When a company hires us to scale agents in production, the first thing we ask is not which agent they plan to use. It is who will review what the agent produces when it runs at night, at dawn, on Sundays.

If the answer is “the same tech lead who reviews human PRs during business hours today,” the deployment does not work.

What we install with Team OS is the layer Aakash described without naming it: a review model that scales with agentic output, not headcount. Risk-weighted sampling, hard invariants, blast-radius limits, and human gates tied to actual cost and risk instead of line counts.

Before you authorize the next agent in production, sketch on paper who reviews it when it opens a PR at 3 AM. If the sketch comes out blank, you already know the conversation we need to have.

Frequently Asked Questions

review at scale AI agents Cognition Windsurf Devin Team OS AI governance

Related Articles

The enablement layer IS the runtime: Miller's framing breaks
Software Development
· 11 min read

The enablement layer IS the runtime: Miller's framing breaks

Adam Miller says Goose is neutral plumbing and the enablement layer is where engineering belongs. Hannah Stulberg's DoorDash Team OS proves exactly the opposite.

agent architecture Claude Code Goose
npm's Worst Day: One Attack, One Leak, Zero Trust
Software Development
· 9 min read

npm's Worst Day: One Attack, One Leak, Zero Trust

Axios was hijacked to deploy a RAT. Claude Code's source leaked via source maps. Same registry, same day — two failure modes your team needs to understand.

npm security supply chain attack axios