Skip to main content

The agentic moat is not the model. It's seven files.

An April 30 paper from Fudan + Peking measures seven harness components. The system prompt is the only one that regresses below baseline when isolated.

The agentic moat is not the model. It's seven files.

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

Software Development 7 min read

On April 29, Elvis Saravia posted a thread that hit 128K views about “Agentic Harness Engineering”, a paper from Fudan + Peking + Shanghai Qiji Zhifeng uploaded to arXiv the day before. Every thread that picked it up quoted the climbing number: pass@1 on Terminal-Bench 2 rises from 69.7% to 77.0% in ten loop iterations, with the base model held fixed. The evolved harness beats the human-designed Codex-CLI harness (71.9%).

There is another number. The one no thread quoted. It sits two pages later, in Table 3.

The seven files of the harness

The paper formalizes an idea that had been floating as folklore for two years. A coding agent is not just the model. It is the model plus seven editable components that live as files in a workspace: system prompt, tool descriptions, tool implementations, middleware, skills, sub-agent configuration, and long-term memory.

The contribution is not naming the seven. The contribution is treating each one as a versioned file with line-level diff, instant rollback, and a change manifest that predicts which tasks the edit should fix and which it puts at risk of regression. Each edit becomes a falsifiable contract: the next round of evaluation either confirms or reverts it.

The Table 3 nobody quoted

Table 3 does the boring but crucial thing: it isolates the effect of each component. It takes the baseline harness (NexAU₀, 69.7% pass@1) and adds one evolved component at a time:

  • + memory only: +5.6 points
  • + tool only: +3.3 points
  • + middleware only: +2.2 points
  • + system_prompt only: -2.3 points

The system prompt was the only component that regressed when isolated from the others. And the second surprise sits in the same table: on Hard tasks, memory alone beats the full evolved harness.

The authors give the cause in one line: “the system prompt encodes 79 lines of universal discipline whose executability depends on the other three.” Discipline without machinery is noise. The agent reads “verify before publishing” but has no middleware that enforces the verification, so the result is more turns spent re-checking work that should have been guarded automatically.

That sentence reads like a delayed description of what Pawel Huryn already published on X back on April 25. Huryn cut his Claude Code monthly bill from $750 to $100 with no model swap, by cleaning four levers: cache hit rate, context budget, model routing, and input format. All four map onto middleware, tools, and tool descriptions. None map onto the prompt. What Huryn did intuitively, the paper formalizes with numbers.

Cross-model transfer is the empirical moat

The section that closes the loop with code-is-not-cheap and runtime-commodity is 4.3. The authors take the harness evolved on GPT-5.4 high and, with no further training, evaluate it on five different base models. All five runs deliver positive gains between +2.3 and +10.1 points, the largest on deepseek-v4-flash and qwen-3.6-plus.

Across model families. The seven harness files do not encode Claude-specific or GPT-specific tricks. They encode general patterns of how coding-agent work gets done. When Anthropic ships Claude Sonnet 4.7 next quarter and a team decides to swap, a well-designed harness survives the change. A system prompt carefully tuned to the previous model does not.

The asterisk: regression blindness

The paper is honest about what the loop does not do well. Section 4.4.2: the agent’s precision predicting which tasks an edit will fix is 33.7% (5x random baseline). Its precision predicting which tasks it will break is 11.8% (~2x random baseline). The loop is reasonably reliable at naming what it repairs. It is blind at naming what it breaks.

Rohan Paul posted on April 30 the summary of Microsoft’s DELEGATE-52 paper with the same symptom: even frontier models corrupt around 25% of document content when delegated long edit chains, because they cannot self-attribute regressions. That is why Howie Liu runs 30 Claude Code instances in parallel on HyperAgent with cross-instance PR review. Cross-PR review is not aesthetic. It is the only strategy that catches the regressions the autonomous loop cannot name.

The pattern I have been watching since 1990

I have been in this 36 years. I started in 1990, at 15, on a Commodore 64. I have seen the same cycle five times: assembly to compiler, hand-rolled SQL to ORM and indexes, manual server tuning to declarative orchestration, custom UI primitives to framework integration. Each time the layer that became cheap was not the problem; the problem was the layer growing on top. The teams that overinvested in the cheap layer lost the cycle. The teams that learned to own the layer above won.

Fifth cycle. The system prompt is the layer the industry is sharpening. The seven harness files are the layer growing above. A team investing this week in prompt templates is repeating the mistake the DBAs were making in 2003 when they kept memorizing Oracle hints.

Five questions before signing the next AI productivity check

If the executive committee is about to approve the next AI productivity invoice, a short test is worth running before the signature.

Start with inventory. Of the seven components, which actually exist in your repo as versioned code with file-level diff and rollback? Any piece that “lives in a Slack channel” is under folklore, not under engineering, and that distinction will matter the next time something regresses.

Then push on falsifiability. Each change should carry a written prediction of what it ought to fix and what it puts at risk, verified against the next run. Without that loop, what teams call prompt engineering is folklore with git on top.

Transferability is the third question, and the easiest to defer. Have you run the same wrapper on a base model other than the one it was tuned against? Until you do, you cannot tell whether your edits encode general agent experience or pegging to a single model that ships its successor next quarter.

The fourth is the one most teams skip in production. When a change breaks tasks that previously worked, what catches it before merge? Howie Liu’s HyperAgent setup answers with cross-instance PR review, not because it is elegant but because the autonomous loop’s regression-prediction precision is 11.8% and your team does not want to operate at that altitude.

And the closing question is about a name. Who owns these seven files, with a real name attached? “The whole team” and “the AI lead” both translate operationally to nobody, and ownerless wrappers degrade at the cadence of the next model release.

If your team cannot answer the five with clarity, the next useful conversation is two hours long. We map one of your real setups, mark which files are under engineering and which are under folklore, and write down what needs intervention. No quote attached. The address is the usual one: info@iqsource.ai.

What we do at IQ Source about this

AI Maestro exists so the seven-file audit happens before the wrapper becomes a load-bearing part of the business. Most executive committees discover during the exercise that four or five of the seven components do not exist as engineering artifacts in their company.

Tech Partner, the other line, applies to software companies whose product lives in the critical zone from day one. For that kind of company, the wrapper stops being an office tool and becomes part of the deliverable. The IQ Source brain I described yesterday is, read through the lens of the Fudan paper, an implementation of one of the seven files: long-term memory, the only layer the public Karpathy chorus was discussing all April, while the other six stayed off the radar.

Three years of prompt engineering optimized the wrong layer. The good news is the paper open-sourced the code and the real levers are measurable. The bad news is the executive committee that does not run the audit this quarter wakes up in the next, the way teams woke up late to the fact that the ORM was the moat over cheap queries.

Frequently Asked Questions

Anthropic AI Maestro Tech Partner AI agents harness engineering Claude Code Pawel Huryn Howie Liu

Related Articles

Code is not cheap: AI productivity is a codebase property
Software Development
· 12 min read

Code is not cheap: AI productivity is a codebase property

Anthropic writes 100% of its code with AI and Google reacted. Pocock and Huryn explain why: AI productivity is a codebase property, not a model property.

Anthropic Claude Code Matt Pocock