Agents ship PRs while you sleep. Who reviews them?
Ricardo Argüello — April 16, 2026
CEO & Founder
General summary
Aakash Gupta ended a thread yesterday with a single line that reads better as a market thesis than a take: 'Whoever solves review at scale wins this entire market.' Cognition paid $250M for Windsurf so Devin can open PRs while you sleep. Cursor bet the opposite. Both collide with the same reality Chamath and Nikunj named the same week: 90% of enterprise code is complex, legacy and not reviewable by volume.
- Cognition bought Windsurf for roughly $250M after OpenAI offered $3B and Google paid $2.4B for the CEO and researchers — the move gave Devin the in-IDE front door it was missing
- Windsurf 2.0 is designed for you to close your laptop; Cursor 3.0 needs your screen open — two opposite bets on where the human belongs in the loop
- Chamath put it bluntly: 90% of code inside enterprises is maintenance and migration of complex existing systems, and vibe coding only serves the other 10%
- Nikunj posted Slack's notification flowchart as visual proof: nobody one-shots logic with forty decision nodes
- The bottleneck stopped being code generation. It's review — and agentic supply is exponential while human reviewer supply is linear
Picture hiring five interns who work only at night. They don't sleep. While you sleep, each one opens a pull request in a different repository, touches code you've never read, and has everything queued for merge by 8 AM. In a normal company you'd review a couple with coffee and approve during business hours. Now you have five a day. Then ten. Then twenty. The problem isn't that the interns write bad code. It's that your capacity to review with judgment doesn't grow at the same rate. That gap is review capacity debt.
AI-generated summary
Aakash Gupta ended a thread yesterday with a single line that reads better as a market thesis than a take:
Whoever solves review at scale wins this entire market.
The setup: OpenAI offered $3B for Windsurf. Google paid $2.4B for the CEO and researchers. Cognition picked up what was left — the product, the IDE, 350 enterprise customers, $82M in ARR — for around $250M. Aakash called it the best deal in AI coding. He is right, but not because of the multiples.
He is right because of the conversion that deal just made possible.
Two bets going in opposite directions
Cognition already had Devin, the autonomous agent that runs on its own VM with a desktop, browser and full computer use. Devin went from $1M ARR in September 2024 to $73M ARR by June 2025. Strong technical adoption. One big product problem: an autonomous agent that lives outside your editor is an agent you forget to check on. Nobody leaves the IDE to monitor a separate tool.
Windsurf gave Devin the front door it was missing. Windsurf 2.0 turns the IDE into a Kanban board for agents. You plan locally, hand off tasks with one click, close your laptop, and wake up to opened PRs.
That is the point: it is the first coding product that explicitly tells the developer to stop watching.
Cursor went the other direction. Cursor 3.0 shipped this month with an agent-first interface for managing parallel AI fleets locally. Every agent needs your screen open. Cursor hit $2B ARR in February, doubling in three months. The product gets better the more attention you give it.
Cognition did the opposite: the product gets better the less attention you give it.
The revenue gap is still huge — $2B against roughly $155M combined for Cognition plus Windsurf. But the categories have already split. Cursor sells a faster coding experience. Cognition sells engineering capacity, measured in cloud hours.
Both bets collide with the same wall.
The wall is called 90%
Chamath Palihapitiya named it yesterday with the bluntness he is known for: 90% of the code inside a company is maintenance and migration of complicated, messy existing systems. Vibe coding handles the other 10% — greenfield, simple, starting from a blank file.
Chamath gets to say this with authority. He built 8090, a company dedicated specifically to the 90%. Fortune 500 companies paying them to migrate complex systems, rewrite old ones, maintain decisions nobody documented. Not theory. His actual business.
Nikunj Kothari put it in a picture. On April 14 he posted Slack’s notification system diagram: forty-plus decision nodes, channel muted, user in DnD, @channel mentions suppressed, subscribed threads, global prefs, channel prefs, mobile push timing. His caption is short: every time he sees a tweet claiming “I can vibe code this in a weekend” he thinks of Slack’s notification flow. It is not a rhetorical example. It is a map. Nobody one-shots that map. Not even with a perfect model.
A reply to Chamath made the point sharper: the 10% is vibe-codeable because you get to define the interface. The 90% is hard because the interface was defined by a thousand decisions made before you arrived. Chesterton’s Fence applies to every legacy system. You cannot refactor what you do not understand.
Put those two observations together and the conclusion is obvious: the future where Devin opens PRs overnight lands directly on code nobody on the team fully understands. Which brings you back to the line Aakash left at the end of his thread.
The bottleneck moved
A month ago I wrote about PR-level code review. That post’s problem was quality: the author accepts a Copilot suggestion, the reviewer opens a diff nobody on the team wrote, and the explanation chain breaks. That problem is still live.
What just showed up is a different problem, one layer up.
If five Devin agents ship overnight, you wake up to PRs in five repos touching logic you have never read. It is not that the author cannot explain. It is that there is no author available when the stack lands. The bottleneck moves from writing to reviewing. And reviewing agentic code produced at volume is a skill most teams never built, because for thirty years the supply of PRs was capped by what humans could write.
That cap just got removed. Review demand did not.
That is why Aakash’s line is the market thesis. Whoever builds the infrastructure that makes a fleet of agents reviewable captures this cycle. Not whoever ships the fastest agent.
Three debts, one liability
I have been naming the same mechanic under different labels for two posts in a row:
- Taste debt, April 13. Pulling the human out of the loop too early bills you in brand, in unexamined decisions, in customers who stop recognizing your voice.
- Consumption debt, April 15, the Uber burned-AI-budget post. Agents running in loops without circuit breakers show up in the P&L directly.
- Review capacity debt. This post. Agents produce more PRs than your team can review with judgment, and the invoice arrives in production incidents, high MTTR and affected customers.
Different sides of the same liability. All amortize the same way: human review wired into the flow, not bolted on top. Shared diagnosis: the mental model we inherited — SaaS per seat, reviews during business hours, annual estimates — was designed for a world where production was capped by humans. That world is gone.
What works for review at scale
Three things that scale, and one that does not.
Risk-weighted sampling. Reviewing every agent-opened PR with the same weight is mathematically impossible. Reviewing in bands works: any change touching a critical domain — billing, auth, customer data — goes through full human review; low-band changes like renames or dependency updates with solid tests behind them run on random sampling. Sampling does not catch every bug. It catches the ones worth stopping a merge for.
Hard invariants the agent cannot skip. Contract tests, property tests, mutation testing. They do not depend on a human reviewing on time. They depend on the agent being unable to land in production unless an objective condition passes. This is what Shopify is building out: agents open PRs, but the merge requires specific gates to pass before a human even opens the diff.
Blast-radius limits per execution. Explicit cap: an agent cannot touch more than N files, more than M modules, or cross layer boundaries without human approval. The goal is not to block work. It is to contain blast radius when the agent is wrong. Five PRs each touching their own service is manageable. One PR refactoring five services in one shot is not reviewable by anyone with finite time.
And the thing that does not work: hiring more reviewers. If your PR supply is agentic and exponential and your reviewer supply is human and linear, you lose the math race on day one.
What we do at IQ Source
When a company hires us to scale agents in production, the first thing we ask is not which agent they plan to use. It is who will review what the agent produces when it runs at night, at dawn, on Sundays.
If the answer is “the same tech lead who reviews human PRs during business hours today,” the deployment does not work.
What we install with Team OS is the layer Aakash described without naming it: a review model that scales with agentic output, not headcount. Risk-weighted sampling, hard invariants, blast-radius limits, and human gates tied to actual cost and risk instead of line counts.
Before you authorize the next agent in production, sketch on paper who reviews it when it opens a PR at 3 AM. If the sketch comes out blank, you already know the conversation we need to have.
Frequently Asked Questions
Review at scale is the process of auditing code generated by fleets of autonomous agents (like Devin inside Windsurf 2.0) that open PRs without direct human intervention. Unlike traditional code review, it assumes PR supply is exponential and reviewer supply is linear, so it uses risk-weighted sampling, hard CI/CD invariants and per-execution blast-radius limits instead of full manual review.
OpenAI offered $3 billion for Windsurf, but Google pulled the CEO and research team out for $2.4 billion. Cognition bought the remaining company — the IDE, 350 enterprise customers and $82 million in ARR — for roughly $250 million. The move was strategic: it gave their autonomous agent Devin the in-IDE front door it needed to show up inside the developer's daily flow.
Cursor 3.0 launched an agent-first interface for managing parallel AI fleets locally, where every agent requires the developer's active attention. Windsurf 2.0 turns the IDE into a Kanban board that hands tasks off to Devin in the cloud and is explicitly designed to let you close your laptop. They are opposite bets on where the human should live inside the generation loop.
Review capacity debt is the operational liability a company takes on when its agents produce more pull requests than the team can review with human judgment. It amortizes through risk-weighted sampling, hard invariants enforced in CI/CD, and human gates tied to actual cost and risk rather than line counts. Hiring more reviewers does not solve it because agentic supply grows exponentially while human supply grows linearly.
Related Articles
The enablement layer IS the runtime: Miller's framing breaks
Adam Miller says Goose is neutral plumbing and the enablement layer is where engineering belongs. Hannah Stulberg's DoorDash Team OS proves exactly the opposite.
npm's Worst Day: One Attack, One Leak, Zero Trust
Axios was hijacked to deploy a RAT. Claude Code's source leaked via source maps. Same registry, same day — two failure modes your team needs to understand.