Shopify and Anthropic: AI Agents in Real Production
Ricardo Argüello — March 13, 2026
CEO & Founder
General summary
Five days after analyzing autoresearch as a concept, two production cases confirm the thesis. Shopify's CEO ran autoresearch on Liquid and achieved 53% faster combined parse+render with 61% fewer object allocations — 29 experiments, 10 kept, 21 files changed. Anthropic runs its entire growth marketing — 6 channels, $19B annualized revenue — with a single non-technical person. The pattern is the same: human directs, agent iterates, human reviews.
- Tobi Lutke ran autoresearch on Shopify's Liquid engine: 53% faster combined parse+render, 61% fewer object allocations
- Anthropic operates 6 growth marketing channels with one person — work that typically requires 15 to 20 people
- The pattern is identical in both cases: human sets constraints, agent runs dozens of experiments, human decides what to keep
- Tobi warned results are 'somewhat overfit' — the numbers are real but interpretation needs nuance
- Preconditions matter more than tools: clear metrics, ready infrastructure, and technical leadership willing to experiment
Imagine the CEO of a $120 billion company telling a program: 'make this code faster — it processes billions of requests per day.' The program tests 29 different ideas overnight, discards 19, and keeps 10 that make everything run 53% faster. Now imagine a $380 billion company running all its digital marketing — paid ads, email, SEO, social media — with one person who uses AI agents to do the work of a team of 20. That happened this week. Not as a pilot. In production.
AI-generated summary
Five days ago we analyzed autoresearch as a concept: 630 lines of code, an agent that iterates without human intervention, and a direction + constraints pattern that applies beyond academic research. That week was theory. This week we have production numbers.
On Tuesday, Tobi Lutke — CEO of Shopify, a $120B market cap company with 5.6 million stores — ran autoresearch on the Liquid rendering engine. Result: 53% faster, 61% fewer object allocations, 21 files changed. Aakash Gupta broke down the numbers the same day.
Two days earlier, Anthropic revealed that a single non-technical person had been running their entire growth marketing for 10 months — 6 channels, $19B annualized revenue, $380B valuation.
Same week, two scales, one identical pattern.
Liquid: 53% Faster, 21 Files, One Night
Liquid is the templating engine behind every Shopify storefront. Every time a buyer loads a product page, Liquid parses the template and renders the HTML. Multiply that by 5.6 million stores and billions of daily requests. A 1% improvement in Liquid performance has measurable impact on Shopify’s global infrastructure.
Tobi didn’t aim for 1%. He pointed autoresearch — Karpathy’s tool we analyzed last week — directly at Liquid’s codebase and let it run.
What the agent did technically: replaced byte-scan matching with compiled regular expressions, eliminated intermediate method dispatches via inlining, and swapped each/while loops for optimized for iterations. None of these are glamorous changes. They’re the optimizations a senior engineer would make given three weeks and full profiler access. The agent found them in one overnight session.
The numbers: 29 experiments run. 19 discarded. 10 kept. 21 files changed. Combined result: 53% faster parse+render, 61% fewer object allocations.
This wasn’t the first run. In an earlier session on a query-expansion model, autoresearch executed 37 experiments and achieved 19% improvement — with a 0.8B parameter model outperforming a 1.6B one. The tool had a track record before it touched Liquid.
There’s something Tobi himself noted: the results are “somewhat overfit.” He flagged it as an honest warning — the benchmarks measure a specific scenario, and real production performance may differ. That’s the kind of nuance most AI tweets skip. That the CEO of a $120B company proactively mentions it says something about how maturely he’s evaluating the tool.
And there’s a detail worth noticing: the CEO of a public company at that scale personally running an autonomous research tool against production code on a Wednesday afternoon. He didn’t delegate to a research team. He didn’t put it on a quarterly roadmap. He just ran it.
A One-Person Marketing Team
On the opposite end of the spectrum — marketing operations instead of infrastructure code — Anthropic revealed a parallel case.
Context: Anthropic, the company behind Claude, went from $9B to $19B in annualized revenue in three months. Its valuation reached $380B. It has over 3,000 employees. This is not an early-stage startup.
Austin Lau is one person. Non-technical. For 10 months, he managed all 6 of Anthropic’s growth marketing channels: paid search, paid social, app stores, email, and SEO. For reference, the industry benchmark for operating those 6 channels is a team of 15 to 20 people with $3M to $5M in annual payroll.
His workflow: Claude Code exports campaign data and flags anomalies automatically, while two specialized agents generate 100 copy and creative variations in 0.5 seconds. For visual assets, a Figma plugin swaps ad templates without manual intervention. On the data side, an MCP server queries Meta’s Ads API directly for real-time metrics — and a memory system feeds learnings from each cycle into the next.
The result: one person operating the marketing engine of a $380B-valued company. Not as a pilot. For 10 months in production.
The caveat needs to be clear: this is Anthropic talking about its own product. The case was published as part of their communications strategy. The numbers haven’t been externally audited. It’s like Toyota publishing a study about its own vehicle reliability — probably true, but the source deserves context. Still, the scale and duration make it hard to dismiss entirely.
Same Pattern, Two Scales
Place these two cases side by side and the structure is identical:
| Shopify | Anthropic | |
|---|---|---|
| Domain | Infrastructure code | Marketing operations |
| Metric | Parse+render time | Campaign performance |
| Experiments | 29 run, 10 kept | Continuous cycle |
| Human role | Evaluate & merge | Direct & scale |
| Caveat | ”Somewhat overfit” | Self-reported |
In both cases, the human doesn’t disappear. They shift position. From execution to direction. Tobi didn’t write the Liquid optimizations — he evaluated them and decided which to keep. Austin didn’t design every ad or adjust every bid — he defined the channels, budget constraints, and success criteria.
This is exactly what we described in the agent operator analysis: both Tobi and Austin are acting as conductors of autonomous agents. The value is no longer in repetitive execution — it’s in knowing what to ask for, how to constrain it, and when to step in.
And the autoresearch loop we analyzed in 630 lines is no longer an academic concept. It’s running at $120B and $380B scale.
The Conditions Nobody Mentions
The easy takeaway from these two cases is: “I need to deploy autonomous agents.” The hard part — and what separates thoughtful adoption from groundless enthusiasm — is asking why they worked.
Both cases share three preconditions:
Clean, quantifiable metrics. Liquid has performance benchmarks: parse time, render time, memory allocations. Anthropic’s marketing has cost per acquisition, CTR, conversion by channel. In both cases, the agent knows exactly what to compare each iteration against. Without that metric, the agent iterates blind.
Ready infrastructure. Neither agent built its own foundation. Shopify already had a mature codebase with test suites and profilers; Anthropic already had advertising platform APIs and connected data systems. The agents operated on top of what existed. If your process lives in shared spreadsheets and forwarded emails, there’s nothing for an agent to operate on.
Leadership willing to experiment. This one is less about technology and more about organizational permission. Tobi ran autoresearch personally — no committee, no quarterly roadmap. Austin had 10 months of operational autonomy because someone at the top signed off. Authorization to experiment came from leadership, not from a six-month approval cycle.
In our experience at IQ Source, the bottleneck is never tool access. Autoresearch is open source. Claude Code is available via subscription. Meta and Google APIs are public. What distinguishes teams that get results is preparation: mapped processes, defined metrics, and documented constraints.
This connects directly to what we analyzed about context engineering: the quality of an agent’s output is proportional to the quality of context it receives. Shopify and Anthropic didn’t use magic tools — they gave existing tools high-quality context.
For B2B companies in Latin America, the question isn’t “can I replicate what Shopify did?” The tools are already available. The right question is: “do I have the preconditions?”
Your First Experiment Doesn’t Have to Be Liquid
Liquid processes billions of requests. Anthropic’s marketing operates with $19B in revenue. The numbers are striking, but you don’t need to operate at that scale to apply the pattern.
What you need is one process with one clear metric. One process. One metric.
Concrete examples for mid-market companies: purchase order accuracy (how many go through without manual correction?), document processing time (from receipt to classification), quote response speed (from request to proposal sent), inventory shrinkage (gap between system count and physical count).
All of these produce a number that goes up or down. Today, improving that number depends on human hands to test adjustments, measure results, and decide what to keep. But an agent can run those iterations overnight while the team rests.
We’re not talking about replacing teams. We’re talking about the team arriving the next morning with 20 tested variations instead of having manually tested 2 last week.
Send us the name of one process and the metric you measure it by today. We’ll send back a quick diagnostic: what’s ready, what’s missing, and what the first viable experiment would look like. No meeting — just one email exchange.
Send your process and metric →Frequently Asked Questions
Shopify CEO Tobi Lutke ran Karpathy's autoresearch tool on the Liquid templating engine — the code that renders every Shopify storefront. The agent ran 29 autonomous experiments, kept 10, and achieved 53% faster combined parse and render time with 61% fewer object allocations. The tool is open source on GitHub.
Austin Lau, a non-technical marketer, managed all 6 of Anthropic's growth marketing channels for 10 months: paid search, paid social, app stores, email, and SEO. He uses Claude Code to export campaign data, two specialized agents for ad copy variations, a Figma plugin for template swapping, and an MCP server connected to Meta's ads API.
Both follow the same pattern: a human sets direction and constraints, an AI agent runs dozens of experiments at machine speed, and the human reviews results to decide what to keep. At Shopify, Tobi evaluated 29 experiments and kept 10. At Anthropic, every result feeds a memory system for the next cycle. The human directs, not executes.
Three preconditions: a clean, quantifiable metric (response time, error rate, cost per acquisition), enough historical data for the agent to benchmark against, and well-defined constraints the agent cannot cross. The tools are accessible to everyone — what distinguishes successful cases is the quality of the preparation.
Related Articles
LiteLLM Attack: Your AI Trust Chain Just Broke
LiteLLM, the AI API key proxy with 97 million monthly downloads, was poisoned via PyPI. Your security scanner was the entry point.
Google Stitch + AI Studio: Design-to-Code Without Engineers
Google shipped a full design-to-production pipeline with Stitch and AI Studio. Where it works for B2B prototypes and where you still need real engineering.