Evals Are the New Process Documentation
Ricardo Argüello — June 28, 2026
CEO & Founder
General summary
Aaron Levie said it directly: almost all AI agent progress is downstream from evals. Garrett Lord refounded Handshake as an evals company after talking with hundreds of executives whose AI programs stall. The reason they stall: the company can't articulate what good looks like, and without that, there's no way to know if the agent is improving.
- Enterprise AI programs stall at the pilot stage not because of technology limitations but because companies haven't defined what a quality output looks like for their own processes.
- Evals are not a thumbs-up or thumbs-down review. They are a criteria system that captures the nuances of judgment, tone, and business context and makes them consistently evaluable.
- A company that can evaluate its own workflows has an advantage no model vendor can provide: the success criteria belongs to the company and scales with the agent.
- The most common error is confusing external benchmarks with business evals. A model can score high on MMLU and still perform poorly on your specific sales or customer service process.
- The AI Opportunity Score that AI Maestro produces is a pre-eval: it identifies which processes have sufficiently articulated success criteria for an eval to make sense.
Imagine hiring a new employee and sending them to serve customers without telling them what good service looks like at your company. After a month, how do you know if they're improving? You can't, because you never defined the standard. With AI agents it's identical. Evals are the act of writing down what a good response is, what's acceptable, and what's unacceptable. Without that, the agent can't improve because nobody knows what standard to measure it against.
AI-generated summary
Garrett Lord refounded Handshake as an evals company after spending months talking with hundreds of executives. The diagnosis he found in almost every conversation was the same: the AI program is stuck in the pilot stage, the team has been trying to scale to production for weeks or months, and nobody is sure why it’s not moving.
The reason he found, reinforced this week by Aaron Levie at Box: companies don’t have a defined view of what a quality output looks like for their own processes. Without that, there’s no way to know if the agent is improving, if it’s worse than the manual process it replaced, or if its errors are systematic.
Levie said it directly: “almost all AI model and agent progress is downstream from evals.” Advances in models, agent architectures, tool capabilities: all of it measures against evals. The companies that win in AI won’t be the ones with access to the best model. They’ll be the ones with the best evals for their own workflows.
Why AI Programs Stall
The stalling mechanism Lord describes is one I recognize from working with teams that have been trying to scale AI for months. It’s not a technology problem. It’s not a model access problem. It’s that the company can’t articulate what good looks like.
An AI pilot that “works” without a defined evaluation standard is a pilot where someone looked at the output and said “seems fine.” That works for the demo. It doesn’t work for scaling. In production, edge cases appear, errors accumulate, and if there’s no explicit standard defining acceptable versus not, the team can’t even agree on whether there’s a problem.
What Lord calls effective evals isn’t a thumbs up/down review or a user survey. It’s a criteria system that captures the nuances of judgment, tone, and business context that matter in each process and makes them consistently evaluable. Deterministic verification for the objective: does the output match the defined scope, are the data points correct, is the format right? LLM-as-judge for the subjective: is the tone appropriate for this customer segment, is the recommendation relevant to the specific context?
The Benchmark vs. Business Eval Confusion
Almost every technical team falls into the same trap when evaluating models: using external benchmarks as a substitute for business evals. MMLU, GPQA, HumanEval, whatever. Benchmarks are useful for comparing general model capabilities. They’re a poor substitute for knowing whether the agent is executing well on your specific sales or customer service process.
A model can score high on every external benchmark and perform poorly on your specific process, because what matters in your process is the particular criterion of your company, not the general capability of the model. As I’ve argued about evals as a compounding asset: the eval that matters isn’t the vendor’s. It’s yours.
And here’s the part Levie emphasizes most forcefully: evals as strategic IP. A company that has built a solid evaluation system for its processes has something no vendor can provide — the success criteria codified for its specific business. That criteria is portable. It works with any model. It scales with the agent. And it compounds over time as standards get refined.
The Prerequisite Nobody Mentions
There’s a step before evals that almost all the discussion skips: you need to know what you want to evaluate before you can evaluate it.
That sounds obvious. In practice, most companies can’t articulate their own quality criteria until someone asks them the right questions. What makes a customer service response excellent versus just acceptable at your company? What signals indicate a lead should go to sales today versus next week? What criteria does your operations team use to decide when to escalate a problem?
Those answers live in the most experienced people on the team. They’re rarely written down. And until they’re written down, they can’t become evals.
That documentation work is exactly the first phase of AI Maestro: mapping real processes, surfacing the quality criteria the team uses implicitly, and converting them into the Opportunity Score that prioritizes where to build first. That Score is a pre-eval: it identifies which processes have sufficiently articulated success criteria for an eval to make sense, and which ones still need the articulation work.
Without that prior diagnostic, the evals you build measure the process as it’s described on paper — which rarely matches the process as it actually runs.
Articulate the success criteria for your operationFrequently Asked Questions
AI evals are criteria systems that measure whether an agent is performing well against a company's specific standards, not generic benchmarks. A model can score high on MMLU or GPQA and still perform poorly on a company's specific sales or customer service process. Business evals capture that particular company criterion.
Because companies can't articulate what good looks like in their own processes. Without that, there is no way to know if the agent is improving, if it's in a local minimum, or if its errors are systematic. The absence of explicit quality criteria is the most common reason AI pilots don't scale to production.
By combining deterministic verification for objective checks (does the output match the defined scope, are the data points correct) with LLM-as-judge for subjective quality (is the tone appropriate, is the recommendation relevant to the specific context). The prior work is writing down the quality criteria for each process before building the eval.
Manual review requires human intervention on every output and doesn't scale. An eval codifies the quality criterion once and applies it automatically to every agent execution. The advantage is that the standard is consistent, auditable, and improves over time as criteria are refined. An eval is process documentation converted into infrastructure.
Related Articles
Your Company's Tacit Knowledge Belongs in a Model It Controls
Satya Nadella says there should be as many AI models as firms in the world. The logic: competitive advantage comes from embedding your accumulated tacit knowledge in weights you own, not borrowing it from a vendor.
Cognitive Delegation, Not Cognitive Surrender
Paul Bakaus, backed by a16z, names the distinction most enterprise AI discussions miss. Delegation: you use AI to get where you decided to go faster. Surrender: you let AI decide where to go. One serves you. The other doesn't.