Skip to main content

The Untrainable Ground: Owning the AI Benchmark

Anything on a public leaderboard gets trained against. The only AI advantage that doesn't expire is the definition of good that lives exclusively inside your company.

The Untrainable Ground: Owning the AI Benchmark

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

AI & Automation 7 min read

Last week, Ramez Naam quoted a line from Sarah Guo’s “The Untrainable” that I can’t stop thinking about: “anything you can put on a leaderboard, you can train against.”

This is the most precise framing I’ve seen for the central strategy question in enterprise AI right now.

Models keep improving. MMLU is gone. HumanEval is gone. The reasoning benchmarks that seemed untouchable two years ago have been beaten, and the ones that seem hard today will be beaten next quarter. Every public measure of model capability has a target painted on it the moment it’s published.

The only ground that doesn’t move is the definition of “good” that lives inside your company, built from your actual work, and that nobody outside has ever written down. That definition is untrainable. And most mid-market companies don’t have it as a document yet.

The companies that already wrote their benchmark

Harvey built the legal benchmark. Not as a research project. They built it because they’ve been inside M&A legal work long enough to know exactly what “correct” means in that context: which clauses are critical versus negotiable depending on transaction type, what kind of error in a due diligence document is recoverable versus deal-threatening, what review standard applies to which document type.

That’s not knowledge you acquire by reading case studies. It accumulates from doing the work.

OpenEvidence did the same in clinical reasoning. Years inside actual medical decision-making, building the evaluation criteria that no generalist model can have because it wasn’t there when those decisions were made. A model that scores well on their benchmark doesn’t do so because it was trained on more generic data. It does so because the benchmark criteria were designed by people who understand what an acceptable answer looks like in that specific clinical context.

This week Aaron Levie published Box’s AI evaluation for M&A due diligence. It isn’t a PR exercise. It’s proof that Box knows what “good” means in document review for mergers and acquisitions, and that they can evaluate any model against that standard. A standard that cost years of real domain work to build, and that no competitor can replicate by shipping a better base model.

In every one of these cases, the advantage is not in which model gets used. It’s in knowing what to measure, and having done the years of internal work required to write that down.

Why that benchmark can’t be replicated through training

Guo’s argument is precise: every public benchmark eventually becomes training data. Labs actively look for what models can’t do well and add it to the next training run. If your competitive advantage in AI depends on a public benchmark, that benchmark has an expiration date.

The trap most mid-market companies fall into is believing that evaluating with generic benchmarks tells them something useful about whether AI will work for their operation. It doesn’t.

The logistics company evaluating models on response speed and grammatical coherence isn’t measuring whether the model can reason about its specific distribution constraints: the service-level agreements on particular routes, the rules that determine when a carrier substitution is acceptable, the exceptions the operations team applies when there are border delays or weather events. None of that is in any public benchmark.

The financial services company using a standard evaluation set to select a legal assistant isn’t measuring whether the model understands the exception handling its team applies for strategic reasons. It doesn’t measure which client gets a different contract structure, which clause is negotiable for certain sectors, or what risk level is acceptable depending on transaction size.

The benchmark that matters for that company isn’t in any paper or model comparison site. It has to be built by the team that knows the operation, from the work they’ve already done.

I wrote about AI evals as a compounding asset a few months ago. The point applies here: the company that has been running evaluations against its own criteria for three years has a calibration advantage no budget can replicate overnight.

What most mid-market companies still haven’t written down

The knowledge exists. In almost every mid-market company, there are people who know exactly what “good” means for the core processes of the operation. The sales director with fifteen years in the company knows when a price objection is real and when it’s a negotiating tactic. The operations manager knows when to apply the exception and when to hold the standard process. The customer success team knows which signals mean a client is about to escalate and what response de-escalates it.

That knowledge exists. The document doesn’t.

Nobody sat down to write what “correct” looks like in that company’s specific sales closing process, with its real clients, real pricing, and real exceptions. Nobody documented the criteria by which the operations team decides when a situation requires escalation versus standard resolution.

When the time comes to evaluate whether an AI system can help with those processes, that knowledge doesn’t exist in any format that an evaluation can use. The company ends up using the vendor’s benchmarks, which measure generic capabilities that don’t reflect the real work. Or the evaluation becomes a few days of the team trying the system and giving their impression.

“It felt good” isn’t an evaluation. It’s a guess.

The problem compounds over time. Every quarter without written criteria is a quarter of lost calibration. And the company that has been evaluating its processes against proprietary criteria for three years has an AI adoption capability that no budget can buy from scratch.

What AI Maestro builds for this

The first deliverable in AI Maestro is the Process Reality Map. It is exactly this exercise: documenting what “good” means for each process in the operation.

Not the generic best-practice document that any consultant could write in two hours. The specific map of how that operation actually works: what each process does, what tools it uses, who makes which decisions and based on what information, where the legitimate exceptions are and where the signals of failure live, what the actual success criteria are for that process with those clients.

With that document, AI adoption stops being a guess and becomes an evaluation with real criteria. Can this model reason correctly about the specific constraints of this operation? Does it handle exceptions the way the team would handle them? Is its response for this client type consistent with the standard the company applies?

Those questions only have useful answers if the company has a written benchmark. Without one, any model response looks reasonable because there’s no comparison point.

The written benchmark is the foundation layer. Yesterday I wrote about the verbal layer that builds on top of it: the recorded conversations in which the team made those same decisions over two or three years. The Process Reality Map documents the criteria. The recordings show how they were applied in real situations, with real edge cases, with clients who tested the boundaries.

Most enterprise AI conversations in mid-market companies are still about which model to use. The conversation that matters more is when the team is going to sit down and write what “good” means for their operation. That is the advantage that doesn’t expire with the next release.

Build your operation’s AI benchmark with AI Maestro

Frequently Asked Questions

AI benchmark competitive advantage AI evaluation model commoditization AI Maestro enterprise AI vertical AI

Related Articles

The Meeting You Didn't Record Is Gone as AI Context
AI & Automation
· 6 min read

The Meeting You Didn't Record Is Gone as AI Context

David Haber at a16z: every unrecorded meeting is AI context you'll never recover. The companies recording everything are building a corpus no competitor can access.

meeting recording enterprise AI corpus AI context