The Untrainable Ground: Owning the AI Benchmark
Ricardo Argüello — June 13, 2026
CEO & Founder
General summary
Models will be trained against everything measurable. The only advantage that doesn't expire is the definition of 'good' that lives inside your company and hasn't been written on any public leaderboard.
- Every public benchmark is temporary. Any lab can train against it in the next cycle. The durable advantage is the private benchmark that defines what 'correct' means for your specific operation.
- Harvey owns the legal benchmark. OpenEvidence owns clinical. Box just published an M&A due diligence eval that took years of domain work to build. In each case, the advantage is the definition, not the model.
- Most mid-market companies have no written benchmark for their core processes. When they evaluate AI, they use generic metrics that don't capture what matters in their business.
- IQ Source's AI Maestro builds the Process Reality Map, which is exactly this exercise: documenting what 'good' means for each operation, with real constraints and real exceptions.
Imagine evaluating an AI system for your legal team using a generic metric: response speed, grammatical accuracy, citation precision. That metric was designed by someone who has never worked inside your company. It doesn't capture which clauses are non-negotiable for your client type, which risks the company accepts for strategic reasons, or how your team handles an exception when the client is critical. You evaluate well against that metric and realize you weren't measuring what actually mattered. That is what most mid-market companies experience when they evaluate AI using benchmarks that aren't theirs.
AI-generated summary
Last week, Ramez Naam quoted a line from Sarah Guo’s “The Untrainable” that I can’t stop thinking about: “anything you can put on a leaderboard, you can train against.”
This is the most precise framing I’ve seen for the central strategy question in enterprise AI right now.
Models keep improving. MMLU is gone. HumanEval is gone. The reasoning benchmarks that seemed untouchable two years ago have been beaten, and the ones that seem hard today will be beaten next quarter. Every public measure of model capability has a target painted on it the moment it’s published.
The only ground that doesn’t move is the definition of “good” that lives inside your company, built from your actual work, and that nobody outside has ever written down. That definition is untrainable. And most mid-market companies don’t have it as a document yet.
The companies that already wrote their benchmark
Harvey built the legal benchmark. Not as a research project. They built it because they’ve been inside M&A legal work long enough to know exactly what “correct” means in that context: which clauses are critical versus negotiable depending on transaction type, what kind of error in a due diligence document is recoverable versus deal-threatening, what review standard applies to which document type.
That’s not knowledge you acquire by reading case studies. It accumulates from doing the work.
OpenEvidence did the same in clinical reasoning. Years inside actual medical decision-making, building the evaluation criteria that no generalist model can have because it wasn’t there when those decisions were made. A model that scores well on their benchmark doesn’t do so because it was trained on more generic data. It does so because the benchmark criteria were designed by people who understand what an acceptable answer looks like in that specific clinical context.
This week Aaron Levie published Box’s AI evaluation for M&A due diligence. It isn’t a PR exercise. It’s proof that Box knows what “good” means in document review for mergers and acquisitions, and that they can evaluate any model against that standard. A standard that cost years of real domain work to build, and that no competitor can replicate by shipping a better base model.
In every one of these cases, the advantage is not in which model gets used. It’s in knowing what to measure, and having done the years of internal work required to write that down.
Why that benchmark can’t be replicated through training
Guo’s argument is precise: every public benchmark eventually becomes training data. Labs actively look for what models can’t do well and add it to the next training run. If your competitive advantage in AI depends on a public benchmark, that benchmark has an expiration date.
The trap most mid-market companies fall into is believing that evaluating with generic benchmarks tells them something useful about whether AI will work for their operation. It doesn’t.
The logistics company evaluating models on response speed and grammatical coherence isn’t measuring whether the model can reason about its specific distribution constraints: the service-level agreements on particular routes, the rules that determine when a carrier substitution is acceptable, the exceptions the operations team applies when there are border delays or weather events. None of that is in any public benchmark.
The financial services company using a standard evaluation set to select a legal assistant isn’t measuring whether the model understands the exception handling its team applies for strategic reasons. It doesn’t measure which client gets a different contract structure, which clause is negotiable for certain sectors, or what risk level is acceptable depending on transaction size.
The benchmark that matters for that company isn’t in any paper or model comparison site. It has to be built by the team that knows the operation, from the work they’ve already done.
I wrote about AI evals as a compounding asset a few months ago. The point applies here: the company that has been running evaluations against its own criteria for three years has a calibration advantage no budget can replicate overnight.
What most mid-market companies still haven’t written down
The knowledge exists. In almost every mid-market company, there are people who know exactly what “good” means for the core processes of the operation. The sales director with fifteen years in the company knows when a price objection is real and when it’s a negotiating tactic. The operations manager knows when to apply the exception and when to hold the standard process. The customer success team knows which signals mean a client is about to escalate and what response de-escalates it.
That knowledge exists. The document doesn’t.
Nobody sat down to write what “correct” looks like in that company’s specific sales closing process, with its real clients, real pricing, and real exceptions. Nobody documented the criteria by which the operations team decides when a situation requires escalation versus standard resolution.
When the time comes to evaluate whether an AI system can help with those processes, that knowledge doesn’t exist in any format that an evaluation can use. The company ends up using the vendor’s benchmarks, which measure generic capabilities that don’t reflect the real work. Or the evaluation becomes a few days of the team trying the system and giving their impression.
“It felt good” isn’t an evaluation. It’s a guess.
The problem compounds over time. Every quarter without written criteria is a quarter of lost calibration. And the company that has been evaluating its processes against proprietary criteria for three years has an AI adoption capability that no budget can buy from scratch.
What AI Maestro builds for this
The first deliverable in AI Maestro is the Process Reality Map. It is exactly this exercise: documenting what “good” means for each process in the operation.
Not the generic best-practice document that any consultant could write in two hours. The specific map of how that operation actually works: what each process does, what tools it uses, who makes which decisions and based on what information, where the legitimate exceptions are and where the signals of failure live, what the actual success criteria are for that process with those clients.
With that document, AI adoption stops being a guess and becomes an evaluation with real criteria. Can this model reason correctly about the specific constraints of this operation? Does it handle exceptions the way the team would handle them? Is its response for this client type consistent with the standard the company applies?
Those questions only have useful answers if the company has a written benchmark. Without one, any model response looks reasonable because there’s no comparison point.
The written benchmark is the foundation layer. Yesterday I wrote about the verbal layer that builds on top of it: the recorded conversations in which the team made those same decisions over two or three years. The Process Reality Map documents the criteria. The recordings show how they were applied in real situations, with real edge cases, with clients who tested the boundaries.
Most enterprise AI conversations in mid-market companies are still about which model to use. The conversation that matters more is when the team is going to sit down and write what “good” means for their operation. That is the advantage that doesn’t expire with the next release.
Build your operation’s AI benchmark with AI MaestroFrequently Asked Questions
A proprietary AI benchmark is the set of criteria that defines what 'correct' means for a company's specific processes. Unlike public benchmarks, it can't be replicated by competitors or exceeded by the next model because it captures internal operational knowledge and the real constraints of that specific organization.
Public benchmarks measure generic capabilities that any lab can train against. An enterprise benchmark must capture the specific success criteria of that operation: which exceptions are valid, which errors are critical, what response is acceptable given the real business context. No public benchmark captures that because it wasn't built from inside that company.
Harvey built the benchmark for legal AI in M&A work. OpenEvidence developed the standard for clinical reasoning. Box published an AI evaluation for due diligence in 2026 built from years of domain work. In each case, it took years inside the work to write down what 'correct' means in that specific vertical.
AI Maestro's Process Reality Map documents how each operation actually works: what it does, what tools it uses, where the real constraints sit, and what the actual success criteria are. That document is the foundation for evaluating any AI system against your own criteria rather than generic metrics that don't reflect your business.
Related Articles
The Meeting You Didn't Record Is Gone as AI Context
David Haber at a16z: every unrecorded meeting is AI context you'll never recover. The companies recording everything are building a corpus no competitor can access.
Starbucks Retires AI Inventory Tool After 9 Months in 11,000 Stores
NomadGo promised 99% accuracy and 8x faster counts. Starbucks rolled it out to 11,000 stores without testing the number on the actual floor. On Monday, they retired it.