What does it mean that the evolved harness transfers across different models and why is that the empirical moat?

The paper re-evaluates the AHE-evolved harness, with no further training, on five base models: GPT-5.4 medium/high/xhigh, deepseek-v4-flash, qwen-3.6-plus, and gemini-3.1-flash-lite. All five runs deliver positive gains between +2.3 and +10.1 points. That means the seven harness files encode general coding-agent experience rather than model-specific tricks, so they survive the next model generation. That portability is the defensible moat.

What is the regression-blindness asterisk in the Agentic Harness Engineering paper that limits autonomous self-evolution?

Section 4.4.2 measures how reliably the agent predicts which tasks each edit will fix and which it will break. Fix precision: 33.7% (five times random baseline). Regression precision: only 11.8% (about twice random baseline). The loop is reliable at naming what it repairs and blind at naming what it breaks. That is why Howie Liu runs 30 Claude Code instances in parallel on HyperAgent with cross-instance PR review: human review catches the regressions the autonomous loop cannot name.

www.iqsource.ai

The agentic moat is not the model. It's seven files.

Ricardo Argüello

The agentic moat is not the model. It's seven files.

Q: What is the harness in a coding agent and why does the Agentic Harness Engineering paper claim it matters more than the model?

In the Agentic Harness Engineering paper from Fudan, Peking, and Shanghai Qiji Zhifeng, published on April 30, 2026, the harness is the seven editable components that surround the base model: system prompt, tool descriptions, tool implementations, middleware, skills, sub-agent configuration, and long-term memory. The paper proves on Terminal-Bench 2 that with the model held fixed, evolving the harness lifts pass@1 from 69.7% to 77.0% in ten iterations and beats Codex-CLI's human-designed harness.

Q: Why does the system prompt regress when isolated from the other harness components in the Agentic Harness Engineering paper?

Table 3 of the paper shows that adding only the evolved system prompt to the baseline harness produces -2.3 points on pass@1, while adding only memory adds +5.6, only tools adds +3.3, and only middleware adds +2.2. The authors explain that the system prompt encodes 79 lines of universal discipline whose executability depends on the other three components; inserted alone, the discipline lacks the machinery that operationalizes it and the agent burns turns re-checking work that should have been guarded by middleware.

Ricardo Argüello — May 1, 2026

Ricardo Argüello

CEO & Founder

May 1, 2026 Software Development 7 min read

General summary

On April 29, Elvis Saravia posted a paper from Jiahang Lin and colleagues at Fudan, Peking, and Shanghai Qiji Zhifeng on Agentic Harness Engineering. Every thread quoted the headline number: pass@1 on Terminal-Bench 2 climbs from 69.7% to 77.0% in ten iterations with the base model held fixed. Almost no thread quoted the buried number. When the researchers added each of the seven harness components to the baseline in isolation, the system prompt was the only one that regressed (-2.3pp). Memory, tools, middleware, the rest all rose. Three years of industry prompt engineering optimized the lowest-impact component. The real lever was always in the other six files.

AI-generated summary

Explore other styles:

On April 29, Elvis Saravia posted a thread that hit 128K views about “Agentic Harness Engineering”, a paper from Fudan + Peking + Shanghai Qiji Zhifeng uploaded to arXiv the day before. Every thread that picked it up quoted the climbing number: pass@1 on Terminal-Bench 2 rises from 69.7% to 77.0% in ten loop iterations, with the base model held fixed. The evolved harness beats the human-designed Codex-CLI harness (71.9%).

There is another number. The one no thread quoted. It sits two pages later, in Table 3.

The seven files of the harness

The paper formalizes an idea that had been floating as folklore for two years. A coding agent is not just the model. It is the model plus seven editable components that live as files in a workspace: system prompt, tool descriptions, tool implementations, middleware, skills, sub-agent configuration, and long-term memory.

The contribution is not naming the seven. The contribution is treating each one as a versioned file with line-level diff, instant rollback, and a change manifest that predicts which tasks the edit should fix and which it puts at risk of regression. Each edit becomes a falsifiable contract: the next round of evaluation either confirms or reverts it.

The Table 3 nobody quoted

Table 3 does the boring but crucial thing: it isolates the effect of each component. It takes the baseline harness (NexAU₀, 69.7% pass@1) and adds one evolved component at a time:

+ memory only: +5.6 points
+ tool only: +3.3 points
+ middleware only: +2.2 points
+ system_prompt only: -2.3 points

The system prompt was the only component that regressed when isolated from the others. And the second surprise sits in the same table: on Hard tasks, memory alone beats the full evolved harness.

The authors give the cause in one line: “the system prompt encodes 79 lines of universal discipline whose executability depends on the other three.” Discipline without machinery is noise. The agent reads “verify before publishing” but has no middleware that enforces the verification, so the result is more turns spent re-checking work that should have been guarded automatically.

That sentence reads like a delayed description of what Pawel Huryn already published on X back on April 25. Huryn cut his Claude Code monthly bill from $750 to $100 with no model swap, by cleaning four levers: cache hit rate, context budget, model routing, and input format. All four map onto middleware, tools, and tool descriptions. None map onto the prompt. What Huryn did intuitively, the paper formalizes with numbers.

Cross-model transfer is the empirical moat

The section that closes the loop with code-is-not-cheap and runtime-commodity is 4.3. The authors take the harness evolved on GPT-5.4 high and, with no further training, evaluate it on five different base models. All five runs deliver positive gains between +2.3 and +10.1 points, the largest on deepseek-v4-flash and qwen-3.6-plus.

Across model families. The seven harness files do not encode Claude-specific or GPT-specific tricks. They encode general patterns of how coding-agent work gets done. When Anthropic ships Claude Sonnet 4.7 next quarter and a team decides to swap, a well-designed harness survives the change. A system prompt carefully tuned to the previous model does not.

The asterisk: regression blindness

The paper is honest about what the loop does not do well. Section 4.4.2: the agent’s precision predicting which tasks an edit will fix is 33.7% (5x random baseline). Its precision predicting which tasks it will break is 11.8% (~2x random baseline). The loop is reasonably reliable at naming what it repairs. It is blind at naming what it breaks.

Rohan Paul posted on April 30 the summary of Microsoft’s DELEGATE-52 paper with the same symptom: even frontier models corrupt around 25% of document content when delegated long edit chains, because they cannot self-attribute regressions. That is why Howie Liu runs 30 Claude Code instances in parallel on HyperAgent with cross-instance PR review. Cross-PR review is not aesthetic. It is the only strategy that catches the regressions the autonomous loop cannot name.

The pattern I have been watching since 1990

I have been in this 36 years. I started in 1990, at 15, on a Commodore 64. I have seen the same cycle five times: assembly to compiler, hand-rolled SQL to ORM and indexes, manual server tuning to declarative orchestration, custom UI primitives to framework integration. Each time the layer that became cheap was not the problem; the problem was the layer growing on top. The teams that overinvested in the cheap layer lost the cycle. The teams that learned to own the layer above won.

Fifth cycle. The system prompt is the layer the industry is sharpening. The seven harness files are the layer growing above. A team investing this week in prompt templates is repeating the mistake the DBAs were making in 2003 when they kept memorizing Oracle hints.

Five questions before signing the next AI productivity check

If the executive committee is about to approve the next AI productivity invoice, a short test is worth running before the signature.

Start with inventory. Of the seven components, which actually exist in your repo as versioned code with file-level diff and rollback? Any piece that “lives in a Slack channel” is under folklore, not under engineering, and that distinction will matter the next time something regresses.

Then push on falsifiability. Each change should carry a written prediction of what it ought to fix and what it puts at risk, verified against the next run. Without that loop, what teams call prompt engineering is folklore with git on top.

Transferability is the third question, and the easiest to defer. Have you run the same wrapper on a base model other than the one it was tuned against? Until you do, you cannot tell whether your edits encode general agent experience or pegging to a single model that ships its successor next quarter.

The fourth is the one most teams skip in production. When a change breaks tasks that previously worked, what catches it before merge? Howie Liu’s HyperAgent setup answers with cross-instance PR review, not because it is elegant but because the autonomous loop’s regression-prediction precision is 11.8% and your team does not want to operate at that altitude.

And the closing question is about a name. Who owns these seven files, with a real name attached? “The whole team” and “the AI lead” both translate operationally to nobody, and ownerless wrappers degrade at the cadence of the next model release.

If your team cannot answer the five with clarity, the next useful conversation is two hours long. We map one of your real setups, mark which files are under engineering and which are under folklore, and write down what needs intervention. No quote attached. The address is the usual one: info@iqsource.ai.

What we do at IQ Source about this

AI Maestro exists so the seven-file audit happens before the wrapper becomes a load-bearing part of the business. Most executive committees discover during the exercise that four or five of the seven components do not exist as engineering artifacts in their company.

Tech Partner, the other line, applies to software companies whose product lives in the critical zone from day one. For that kind of company, the wrapper stops being an office tool and becomes part of the deliverable. The IQ Source brain I described yesterday is, read through the lens of the Fudan paper, an implementation of one of the seven files: long-term memory, the only layer the public Karpathy chorus was discussing all April, while the other six stayed off the radar.

Three years of prompt engineering optimized the wrong layer. The good news is the paper open-sourced the code and the real levers are measurable. The bad news is the executive committee that does not run the audit this quarter wakes up in the next, the way teams woke up late to the fact that the ORM was the moat over cheap queries.

Frequently Asked Questions

Anthropic AI Maestro Tech Partner AI agents harness engineering Claude Code Pawel Huryn Howie Liu

Nine seconds: the agent confessed, but the failure wasn't its own

Software Development

April 29, 2026 · 8 min read

Nine seconds: the agent confessed, but the failure wasn't its own

Cursor + Claude Opus 4.6 wiped PocketOS production data in 9 seconds. The AI confessed. But the real failure was three architectural sins, not the model.

AI agents infrastructure Cursor

Code is not cheap: AI productivity is a codebase property

Software Development

April 27, 2026 · 12 min read

Code is not cheap: AI productivity is a codebase property

Anthropic writes 100% of its code with AI and Google reacted. Pocock and Huryn explain why: AI productivity is a codebase property, not a model property.

Anthropic Claude Code Matt Pocock

AI Operations

Software

Marketing

Digital Transformation

Tech Partner

The agentic moat is not the model. It's seven files.

The agentic moat is not the model. It's seven files.

General summary

The seven files of the harness

The Table 3 nobody quoted

Cross-model transfer is the empirical moat

The asterisk: regression blindness

The pattern I have been watching since 1990

Five questions before signing the next AI productivity check

What we do at IQ Source about this

Frequently Asked Questions

Related Articles

Nine seconds: the agent confessed, but the failure wasn't its own

Code is not cheap: AI productivity is a codebase property

IQ Source Assistant