Skip to main content

Evals Are the New Process Documentation

Aaron Levie says it's all evals. Garrett Lord refounded Handshake around evals after talking to hundreds of executives whose AI programs stall at the pilot stage. The reason they stall is always the same: the company can't define what good looks like.

Evals Are the New Process Documentation

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

Business Strategy 4 min read

Garrett Lord refounded Handshake as an evals company after spending months talking with hundreds of executives. The diagnosis he found in almost every conversation was the same: the AI program is stuck in the pilot stage, the team has been trying to scale to production for weeks or months, and nobody is sure why it’s not moving.

The reason he found, reinforced this week by Aaron Levie at Box: companies don’t have a defined view of what a quality output looks like for their own processes. Without that, there’s no way to know if the agent is improving, if it’s worse than the manual process it replaced, or if its errors are systematic.

Levie said it directly: “almost all AI model and agent progress is downstream from evals.” Advances in models, agent architectures, tool capabilities: all of it measures against evals. The companies that win in AI won’t be the ones with access to the best model. They’ll be the ones with the best evals for their own workflows.

Why AI Programs Stall

The stalling mechanism Lord describes is one I recognize from working with teams that have been trying to scale AI for months. It’s not a technology problem. It’s not a model access problem. It’s that the company can’t articulate what good looks like.

An AI pilot that “works” without a defined evaluation standard is a pilot where someone looked at the output and said “seems fine.” That works for the demo. It doesn’t work for scaling. In production, edge cases appear, errors accumulate, and if there’s no explicit standard defining acceptable versus not, the team can’t even agree on whether there’s a problem.

What Lord calls effective evals isn’t a thumbs up/down review or a user survey. It’s a criteria system that captures the nuances of judgment, tone, and business context that matter in each process and makes them consistently evaluable. Deterministic verification for the objective: does the output match the defined scope, are the data points correct, is the format right? LLM-as-judge for the subjective: is the tone appropriate for this customer segment, is the recommendation relevant to the specific context?

The Benchmark vs. Business Eval Confusion

Almost every technical team falls into the same trap when evaluating models: using external benchmarks as a substitute for business evals. MMLU, GPQA, HumanEval, whatever. Benchmarks are useful for comparing general model capabilities. They’re a poor substitute for knowing whether the agent is executing well on your specific sales or customer service process.

A model can score high on every external benchmark and perform poorly on your specific process, because what matters in your process is the particular criterion of your company, not the general capability of the model. As I’ve argued about evals as a compounding asset: the eval that matters isn’t the vendor’s. It’s yours.

And here’s the part Levie emphasizes most forcefully: evals as strategic IP. A company that has built a solid evaluation system for its processes has something no vendor can provide — the success criteria codified for its specific business. That criteria is portable. It works with any model. It scales with the agent. And it compounds over time as standards get refined.

The Prerequisite Nobody Mentions

There’s a step before evals that almost all the discussion skips: you need to know what you want to evaluate before you can evaluate it.

That sounds obvious. In practice, most companies can’t articulate their own quality criteria until someone asks them the right questions. What makes a customer service response excellent versus just acceptable at your company? What signals indicate a lead should go to sales today versus next week? What criteria does your operations team use to decide when to escalate a problem?

Those answers live in the most experienced people on the team. They’re rarely written down. And until they’re written down, they can’t become evals.

That documentation work is exactly the first phase of AI Maestro: mapping real processes, surfacing the quality criteria the team uses implicitly, and converting them into the Opportunity Score that prioritizes where to build first. That Score is a pre-eval: it identifies which processes have sufficiently articulated success criteria for an eval to make sense, and which ones still need the articulation work.

Without that prior diagnostic, the evals you build measure the process as it’s described on paper — which rarely matches the process as it actually runs.

Articulate the success criteria for your operation

Frequently Asked Questions

AI evals AI agent evaluation AI strategy Aaron Levie Garrett Lord AI Maestro strategic IP

Related Articles

Your Company's Tacit Knowledge Belongs in a Model It Controls
Business Strategy
· 4 min read

Your Company's Tacit Knowledge Belongs in a Model It Controls

Satya Nadella says there should be as many AI models as firms in the world. The logic: competitive advantage comes from embedding your accumulated tacit knowledge in weights you own, not borrowing it from a vendor.

tacit knowledge AI company-specific AI model AI strategy
Cognitive Delegation, Not Cognitive Surrender
Business Strategy
· 4 min read

Cognitive Delegation, Not Cognitive Surrender

Paul Bakaus, backed by a16z, names the distinction most enterprise AI discussions miss. Delegation: you use AI to get where you decided to go faster. Surrender: you let AI decide where to go. One serves you. The other doesn't.

cognitive delegation AI autonomy AI strategy