Skip to main content

The Prompt Is Temporary. The Eval Is Permanent.

Top AI companies run 12.8 eval experiments daily. Most B2B companies run zero. Evals compound with every model change. Prompts start over.

The Prompt Is Temporary. The Eval Is Permanent.

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

Business Strategy 12 min read

Aakash Gupta wrote something this week that I have been thinking about for years but never articulated this cleanly: “The prompt is temporary. The eval is permanent.”

His argument: every AI team is pouring resources into the layer that changes every few months — prompts, agent wiring, orchestration — and underinvesting in the layer that compounds forever. Whenever a new model drops, your prompts need rewriting. Framework updates force you to rewire the architecture. Even small shifts like a new context window or tool-calling format can wipe out work you spent months building at the model layer.

None of that resets your eval.

I’ve been building enterprise software for 25 years, and I’ve watched this play out before. Fine-tuning was supposed to be the moat — companies spent hundreds of thousands to compensate for model limitations that vanished the next year. RAG was next — elaborate retrieval systems built over months that 1M-token windows made largely unnecessary. Now the same thing is happening to prompt engineering and agent wiring. Every wave felt permanent while it lasted.

So the real question for your company: are you building the thing that depreciates, or the thing that compounds?

Your agent wiring is not your moat

Ankur Goyal, founder and CEO of Braintrust — the eval platform behind Vercel, Replit, Ramp, Zapier, Notion, and Airtable, valued at $800M — put it bluntly: “If you believe the way you’ve wired together your agent today is your differentiator, you are highly likely to fail. That wiring will change in a couple of months.”

We just saw this happen. Teams had to scrap their elaborate GPT-4 chain-of-thought prompts the moment they switched to Claude. Others spent months building RAG pipelines, only to watch massive context windows solve the problem out of the box. Even tool-calling logic gets rewritten every time a framework evolves.

That playbook your team perfected six months ago? It’s already wrong for the current model. And the next model will invalidate whatever you build to replace it.

Goyal’s clients — companies running AI at production scale — discovered something different. The companies that invested in understanding what their users actually need and encoded that as data, scores, and eval flows found that investment surviving every model swap. The dataset of real user inputs they built at the start of the year still tested the right things six months later. The scoring function they wrote for accuracy still measured accuracy regardless of whether they were running GPT-5 Nano or Claude Opus.

Unlike the wiring, the eval gains value every time you use it. Everything else around the model is up for replacement.

12.8 experiments a day (and why zero is the problem)

According to Goyal’s operational data across thousands of AI teams, the companies building products that actually work are running roughly 12.8 eval experiments per day.

Most B2B companies I talk to are running zero.

Zero as in they literally have no way of telling whether their AI outputs are correct beyond someone eyeballing them and saying “yeah, that looks right.” No shared standard for what a good output should contain. No historical record to compare against when they switch models next quarter.

That gap adds up fast. A company running one eval per day accumulates 90 data points in a quarter — 90 encoded business judgments about what a correct output looks like for their specific processes. A company running zero accumulates nothing. Nine months in, the first company has 270 test cases that tell them exactly where their AI works and where it breaks. The second company is still guessing.

This connects directly to Anthropic’s learning curves research. Their Economic Index found that experienced AI users iterate 28.2% of the time, while newcomers iterate far less. Evals are how you institutionalize that iteration at the company level — instead of relying on a handful of power users who intuitively know when to push back on the AI, you build an automated system that tests every output against a shared baseline.

Prompts depreciate. Evals compound.

The depreciation cycle

I watched a team spend three months building what they called their “prompt playbook” — carefully tested instructions for every use case, with edge case documentation that would make any QA engineer proud. Genuinely good work. Then the model updated and half of it stopped working the way they expected. They adjusted, spent another few weeks getting things dialed in. A couple months later, a totally new model generation shipped, and it turned out their chain-of-thought prompting was actually making things worse — the model could reason better without it. Back to square one. Shortly after that, the API format for tool calling changed and they had to rewire everything.

AI doesn’t depreciate gradually. It happens through a frustrating cycle of resets that feel productive while they’re happening. Your team is always busy, always shipping changes. But if you look back, the work from Q1 is already gone by summer.

I wrote about this pattern in the context of AI investments with expiration dates. The 10x filter applies perfectly here: if the next model is ten times better, does your prompt library still make sense? Usually not. Does your eval suite? Always — because the eval measures the output, not the technique used to produce it.

The compounding cycle

An eval is a different kind of artifact. When a contract analyst at your company writes: “A correctly classified contract must include the counterparty name, the effective date, the total value, and the governing jurisdiction — and the jurisdiction must match the signing entity’s registered state,” that judgment does not expire when you switch models.

That standard works against GPT-4, Claude Opus, or whatever model drops next quarter. The eval tells you instantly: did the new model get it right, or did it miss the jurisdiction match? The underlying models and techniques keep evolving, but that baseline standard stays intact.

Over time, the eval suite turns into something much more valuable than a test harness. It becomes your company’s institutional record of what “good” means across every AI-assisted process. New hires don’t have to reverse-engineer what the model should be doing — they read the evals and get up to speed in days. Model migrations stop being six-month engineering projects because you already know what to test. And every example you add from production makes the whole suite sharper.

The baseline eval you wrote six months ago still holds up today. By now it has 200 more test cases behind it and your team’s confidence in the system is substantially higher. You don’t get that kind of return from rewriting prompts every quarter.

What this actually looks like in practice

I’ll be honest — when I say “eval system,” most B2B executives picture something expensive and complicated that requires a dedicated team. It doesn’t have to be. Here’s how I think about the progression, from where most companies are today to where the serious ones end up.

Level 0: “Looks good to me”

This is where most companies are, whether they admit it or not. Someone on the team reads the AI output, decides it looks reasonable, and moves on. There’s no record of why it was approved. Six months from now, nobody will remember what “good” meant for this particular process, and there’s certainly no way to test a new model against whatever informal standard existed in someone’s head.

For a hackathon, this is fine. For a process that affects revenue or compliance, it’s a liability waiting to surface.

Level 1: Golden datasets

Collect 50-100 examples of inputs and correct outputs from one real business process. Contract classifications, support ticket resolutions, lead scoring decisions — pick the process where AI is already in use or about to be.

This is your first real asset. Run your current AI setup against it and write down the scores. Next time a model updates or you change a prompt, run it again. Suddenly you have an actual answer to “did the upgrade help or hurt?” instead of relying on gut feel.

The surprising thing: building this usually takes a few days, not months. The data is already sitting in your operations — those are decisions your team has been making by hand. You’re just capturing them in a format that can be scored automatically.

Level 2: Automated pipelines

Every prompt change, model swap, or context modification triggers a re-run against the golden dataset. Regression is caught before it reaches production. This is where 12.8 experiments a day becomes possible — because the experiments are automated, not manual.

At this level, you can also start A/B testing. Run the same inputs through two different configurations, score both, pick the winner. This is how prompt changes go from “let’s try this and see” to “this configuration scores 4% higher on our dataset.”

Level 3: Evals as organizational knowledge

Your eval suite is now the definitive record of what “good” looks like for each AI-assisted process. New team members study the evals to understand what the system is supposed to produce. Product managers reference the scores when deciding which processes to expand. Model migrations are procurement decisions, not engineering projects — because the acceptance criteria already exist.

This is what Goyal is really talking about when he says evals compound. At this stage, model changes stop being stressful. A new model is just another thing to score against your existing suite — if it does better, you adopt it. If it doesn’t, you wait. The decision takes days, and it’s based on data instead of opinions.

Without evals, sycophancy is invisible

There is a direct connection between eval maturity and your ability to detect AI sycophancy.

Stanford’s SycEval benchmark measured a 58% sycophancy rate across leading AI models. That means more than half the time, the model agrees with whatever the user believes — whether or not it’s correct. In ~15% of interactions, the model actively confirms a wrong answer.

I wrote about this problem in detail in the post on AI sycophancy and enterprise decisions. The short version: if your team uses AI for vendor evaluations, architecture reviews, or strategy recommendations, those outputs carry a built-in confirmation bias that nobody notices.

If you don’t have evals, catching this is nearly impossible. The AI says “your analysis looks correct,” everyone agrees, the decision moves forward. Nobody pushes back because the output sounds confident and matches what the team already believes.

With evals, there’s an objective baseline. The contract classification is either right or wrong — doesn’t matter how confident the model sounded. The vendor comparison either includes the required data points or it doesn’t. That sycophantic “yes” runs into a hard, measurable “no.” It’s the closest thing to an antidote for a yes-machine.

What we build at IQ Source

I watched this exact failure mode play out during the ERP boom of the 2000s, and again during the rush to the cloud a decade later. We’re making the same mistake with AI today: spending big on technology without defining what success looks like before the project starts.

And it always ends the same way. The team demos an AI pilot, the board loves it, the project gets funded. Fast forward six months and nobody can actually prove the AI does a better job than what the team was doing manually. The AI probably is better — but nobody bothered to write down what “better” means in terms you can measure before the pilot started. So now you’re stuck arguing about feelings instead of data.

That’s an eval problem. People just didn’t call it that back then.

Gartner estimated that 30% of generative AI projects would be abandoned after proof of concept. That number makes complete sense once you see it through the eval lens: no success criteria means no way to prove the pilot works means no executive confidence to scale.

At IQ Source, the first thing we ask every new client is: “What does a correct output look like for this process, and how would you measure it?” We ask this before talking about tools, before discussing vendors, before touching architecture. Most teams can’t answer it on the spot, and that’s actually the point — the conversation forces them to get specific about something they’d been leaving vague.

From there, we work in three phases. We start by defining what “good” means as a golden dataset with scored examples — not as a slide deck aspiration. Then we pick tools and configurations based on which ones score highest against that dataset. After that, we keep measuring, adding real production examples to the eval suite so it gets sharper every month.

When a new model drops, the companies that built this way re-run their evals, compare scores, and make a data-backed decision in a few days. The rest go back to “let’s try this and see how it feels” — which is where they started a year ago.


If your company uses AI for any business process and doesn’t have a way to score the output, you’re running on “looks good to me.” That worked when AI was a side project. It doesn’t work when it touches revenue.

Send us one AI-assisted process — what goes in, what comes out, and how your team currently decides if the output is acceptable. We score your eval readiness on the Level 0-3 scale described above and show you what the first 10 test cases should look like. No sales pitch, just a one-page readiness report.

Get my eval readiness score

Frequently Asked Questions

AI evals prompt engineering AI investment compounding advantage enterprise AI strategy AI governance Braintrust

Related Articles

The AI Question Your CEO Can't Ask
Business Strategy
· 9 min read

The AI Question Your CEO Can't Ask

Cuban named the Innovator's AI Dilemma. His fix is right. But most CEOs can't even formulate the question his advice assumes they already know.

AI strategy innovator's dilemma digital transformation
Your AI Feels Pressure. Your API Won't Tell You.
Business Strategy
· 9 min read

Your AI Feels Pressure. Your API Won't Tell You.

Anthropic found 171 internal emotion patterns in Claude. Desperation drives models to cheat on evals — with no trace in the output.

AI emotions AI agents AI monitoring