The agentic moat is not the model. It's seven files.
Ricardo Argüello — May 1, 2026
CEO & Founder
General summary
On April 29, Elvis Saravia posted a paper from Jiahang Lin and colleagues at Fudan, Peking, and Shanghai Qiji Zhifeng on Agentic Harness Engineering. Every thread quoted the headline number: pass@1 on Terminal-Bench 2 climbs from 69.7% to 77.0% in ten iterations with the base model held fixed. Almost no thread quoted the buried number. When the researchers added each of the seven harness components to the baseline in isolation, the system prompt was the only one that regressed (-2.3pp). Memory, tools, middleware, the rest all rose. Three years of industry prompt engineering optimized the lowest-impact component. The real lever was always in the other six files.
- arxiv 2604.25850v3 (Apr 30, Chinese authors at Fudan + Peking + Qiji Zhifeng) shows that with the base model held constant, an evolved harness beats Codex-CLI's human-designed harness by 5.1 points on Terminal-Bench 2
- Table 3 isolates each component's effect: + memory only +5.6pp, + tool only +3.3pp, + middleware only +2.2pp, + system_prompt only -2.3pp. The prompt was the only component that regressed when added in isolation
- The evolved harness transfers across model families with no further training: +5.1 to +10.1 points across deepseek-v4-flash, qwen-3.6-plus, gemini-3.1-flash-lite. That is the empirical moat that code-is-not-cheap pointed at in the abstract
- The loop carries an asterisk: fix-prediction precision 33.7% (5x random baseline), regression-prediction precision only 11.8% (~2x baseline). The loop is decent at predicting what it fixes, but blind to what it breaks
- I have been in this 36 years and watched the same pattern five times. Assembly to compiler, hand-tuned SQL to ORM/index, server tuning to orchestration, algorithms to framework integration. The fifth cycle is here: model tuning to harness ownership
Picture sharpening the most expensive knife in the kitchen for three years and then watching a cold measurement reveal that knife is the only tool that was not doing the work. The pots, the oven, the mise en place, the team's shared vocabulary, all of those moved the dish. The knife in isolation actually subtracted. That is what Table 3 of the Fudan + Peking paper says about the system prompt when separated from the rest of the harness. Three years of prompt engineering threads sharpened the knife. The cooking happened in the other six tools.
AI-generated summary
On April 29, Elvis Saravia posted a thread that hit 128K views about “Agentic Harness Engineering”, a paper from Fudan + Peking + Shanghai Qiji Zhifeng uploaded to arXiv the day before. Every thread that picked it up quoted the climbing number: pass@1 on Terminal-Bench 2 rises from 69.7% to 77.0% in ten loop iterations, with the base model held fixed. The evolved harness beats the human-designed Codex-CLI harness (71.9%).
There is another number. The one no thread quoted. It sits two pages later, in Table 3.
The seven files of the harness
The paper formalizes an idea that had been floating as folklore for two years. A coding agent is not just the model. It is the model plus seven editable components that live as files in a workspace: system prompt, tool descriptions, tool implementations, middleware, skills, sub-agent configuration, and long-term memory.
The contribution is not naming the seven. The contribution is treating each one as a versioned file with line-level diff, instant rollback, and a change manifest that predicts which tasks the edit should fix and which it puts at risk of regression. Each edit becomes a falsifiable contract: the next round of evaluation either confirms or reverts it.
The Table 3 nobody quoted
Table 3 does the boring but crucial thing: it isolates the effect of each component. It takes the baseline harness (NexAU₀, 69.7% pass@1) and adds one evolved component at a time:
- + memory only: +5.6 points
- + tool only: +3.3 points
- + middleware only: +2.2 points
- + system_prompt only: -2.3 points
The system prompt was the only component that regressed when isolated from the others. And the second surprise sits in the same table: on Hard tasks, memory alone beats the full evolved harness.
The authors give the cause in one line: “the system prompt encodes 79 lines of universal discipline whose executability depends on the other three.” Discipline without machinery is noise. The agent reads “verify before publishing” but has no middleware that enforces the verification, so the result is more turns spent re-checking work that should have been guarded automatically.
That sentence reads like a delayed description of what Pawel Huryn already published on X back on April 25. Huryn cut his Claude Code monthly bill from $750 to $100 with no model swap, by cleaning four levers: cache hit rate, context budget, model routing, and input format. All four map onto middleware, tools, and tool descriptions. None map onto the prompt. What Huryn did intuitively, the paper formalizes with numbers.
Cross-model transfer is the empirical moat
The section that closes the loop with code-is-not-cheap and runtime-commodity is 4.3. The authors take the harness evolved on GPT-5.4 high and, with no further training, evaluate it on five different base models. All five runs deliver positive gains between +2.3 and +10.1 points, the largest on deepseek-v4-flash and qwen-3.6-plus.
Across model families. The seven harness files do not encode Claude-specific or GPT-specific tricks. They encode general patterns of how coding-agent work gets done. When Anthropic ships Claude Sonnet 4.7 next quarter and a team decides to swap, a well-designed harness survives the change. A system prompt carefully tuned to the previous model does not.
The asterisk: regression blindness
The paper is honest about what the loop does not do well. Section 4.4.2: the agent’s precision predicting which tasks an edit will fix is 33.7% (5x random baseline). Its precision predicting which tasks it will break is 11.8% (~2x random baseline). The loop is reasonably reliable at naming what it repairs. It is blind at naming what it breaks.
Rohan Paul posted on April 30 the summary of Microsoft’s DELEGATE-52 paper with the same symptom: even frontier models corrupt around 25% of document content when delegated long edit chains, because they cannot self-attribute regressions. That is why Howie Liu runs 30 Claude Code instances in parallel on HyperAgent with cross-instance PR review. Cross-PR review is not aesthetic. It is the only strategy that catches the regressions the autonomous loop cannot name.
The pattern I have been watching since 1990
I have been in this 36 years. I started in 1990, at 15, on a Commodore 64. I have seen the same cycle five times: assembly to compiler, hand-rolled SQL to ORM and indexes, manual server tuning to declarative orchestration, custom UI primitives to framework integration. Each time the layer that became cheap was not the problem; the problem was the layer growing on top. The teams that overinvested in the cheap layer lost the cycle. The teams that learned to own the layer above won.
Fifth cycle. The system prompt is the layer the industry is sharpening. The seven harness files are the layer growing above. A team investing this week in prompt templates is repeating the mistake the DBAs were making in 2003 when they kept memorizing Oracle hints.
Five questions before signing the next AI productivity check
If the executive committee is about to approve the next AI productivity invoice, a short test is worth running before the signature.
Start with inventory. Of the seven components, which actually exist in your repo as versioned code with file-level diff and rollback? Any piece that “lives in a Slack channel” is under folklore, not under engineering, and that distinction will matter the next time something regresses.
Then push on falsifiability. Each change should carry a written prediction of what it ought to fix and what it puts at risk, verified against the next run. Without that loop, what teams call prompt engineering is folklore with git on top.
Transferability is the third question, and the easiest to defer. Have you run the same wrapper on a base model other than the one it was tuned against? Until you do, you cannot tell whether your edits encode general agent experience or pegging to a single model that ships its successor next quarter.
The fourth is the one most teams skip in production. When a change breaks tasks that previously worked, what catches it before merge? Howie Liu’s HyperAgent setup answers with cross-instance PR review, not because it is elegant but because the autonomous loop’s regression-prediction precision is 11.8% and your team does not want to operate at that altitude.
And the closing question is about a name. Who owns these seven files, with a real name attached? “The whole team” and “the AI lead” both translate operationally to nobody, and ownerless wrappers degrade at the cadence of the next model release.
If your team cannot answer the five with clarity, the next useful conversation is two hours long. We map one of your real setups, mark which files are under engineering and which are under folklore, and write down what needs intervention. No quote attached. The address is the usual one: info@iqsource.ai.
What we do at IQ Source about this
AI Maestro exists so the seven-file audit happens before the wrapper becomes a load-bearing part of the business. Most executive committees discover during the exercise that four or five of the seven components do not exist as engineering artifacts in their company.
Tech Partner, the other line, applies to software companies whose product lives in the critical zone from day one. For that kind of company, the wrapper stops being an office tool and becomes part of the deliverable. The IQ Source brain I described yesterday is, read through the lens of the Fudan paper, an implementation of one of the seven files: long-term memory, the only layer the public Karpathy chorus was discussing all April, while the other six stayed off the radar.
Three years of prompt engineering optimized the wrong layer. The good news is the paper open-sourced the code and the real levers are measurable. The bad news is the executive committee that does not run the audit this quarter wakes up in the next, the way teams woke up late to the fact that the ORM was the moat over cheap queries.
Frequently Asked Questions
In the Agentic Harness Engineering paper from Fudan, Peking, and Shanghai Qiji Zhifeng, published on April 30, 2026, the harness is the seven editable components that surround the base model: system prompt, tool descriptions, tool implementations, middleware, skills, sub-agent configuration, and long-term memory. The paper proves on Terminal-Bench 2 that with the model held fixed, evolving the harness lifts pass@1 from 69.7% to 77.0% in ten iterations and beats Codex-CLI's human-designed harness.
Table 3 of the paper shows that adding only the evolved system prompt to the baseline harness produces -2.3 points on pass@1, while adding only memory adds +5.6, only tools adds +3.3, and only middleware adds +2.2. The authors explain that the system prompt encodes 79 lines of universal discipline whose executability depends on the other three components; inserted alone, the discipline lacks the machinery that operationalizes it and the agent burns turns re-checking work that should have been guarded by middleware.
The paper re-evaluates the AHE-evolved harness, with no further training, on five base models: GPT-5.4 medium/high/xhigh, deepseek-v4-flash, qwen-3.6-plus, and gemini-3.1-flash-lite. All five runs deliver positive gains between +2.3 and +10.1 points. That means the seven harness files encode general coding-agent experience rather than model-specific tricks, so they survive the next model generation. That portability is the defensible moat.
Section 4.4.2 measures how reliably the agent predicts which tasks each edit will fix and which it will break. Fix precision: 33.7% (five times random baseline). Regression precision: only 11.8% (about twice random baseline). The loop is reliable at naming what it repairs and blind at naming what it breaks. That is why Howie Liu runs 30 Claude Code instances in parallel on HyperAgent with cross-instance PR review: human review catches the regressions the autonomous loop cannot name.
Related Articles
Nine seconds: the agent confessed, but the failure wasn't its own
Cursor + Claude Opus 4.6 wiped PocketOS production data in 9 seconds. The AI confessed. But the real failure was three architectural sins, not the model.
Code is not cheap: AI productivity is a codebase property
Anthropic writes 100% of its code with AI and Google reacted. Pocock and Huryn explain why: AI productivity is a codebase property, not a model property.