Skip to main content

Shopify and Anthropic: AI Agents in Real Production

Shopify's CEO got 53% faster Liquid rendering via autoresearch. Anthropic runs 6 marketing channels with one person. From theory to production numbers.

Shopify and Anthropic: AI Agents in Real Production

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

AI & Automation 7 min read

Five days ago we analyzed autoresearch as a concept: 630 lines of code, an agent that iterates without human intervention, and a direction + constraints pattern that applies beyond academic research. That week was theory. This week we have production numbers.

On Tuesday, Tobi Lutke — CEO of Shopify, a $120B market cap company with 5.6 million stores — ran autoresearch on the Liquid rendering engine. Result: 53% faster, 61% fewer object allocations, 21 files changed. Aakash Gupta broke down the numbers the same day.

Two days earlier, Anthropic revealed that a single non-technical person had been running their entire growth marketing for 10 months — 6 channels, $19B annualized revenue, $380B valuation.

Same week, two scales, one identical pattern.

Liquid: 53% Faster, 21 Files, One Night

Liquid is the templating engine behind every Shopify storefront. Every time a buyer loads a product page, Liquid parses the template and renders the HTML. Multiply that by 5.6 million stores and billions of daily requests. A 1% improvement in Liquid performance has measurable impact on Shopify’s global infrastructure.

Tobi didn’t aim for 1%. He pointed autoresearch — Karpathy’s tool we analyzed last week — directly at Liquid’s codebase and let it run.

What the agent did technically: replaced byte-scan matching with compiled regular expressions, eliminated intermediate method dispatches via inlining, and swapped each/while loops for optimized for iterations. None of these are glamorous changes. They’re the optimizations a senior engineer would make given three weeks and full profiler access. The agent found them in one overnight session.

The numbers: 29 experiments run. 19 discarded. 10 kept. 21 files changed. Combined result: 53% faster parse+render, 61% fewer object allocations.

This wasn’t the first run. In an earlier session on a query-expansion model, autoresearch executed 37 experiments and achieved 19% improvement — with a 0.8B parameter model outperforming a 1.6B one. The tool had a track record before it touched Liquid.

There’s something Tobi himself noted: the results are “somewhat overfit.” He flagged it as an honest warning — the benchmarks measure a specific scenario, and real production performance may differ. That’s the kind of nuance most AI tweets skip. That the CEO of a $120B company proactively mentions it says something about how maturely he’s evaluating the tool.

And there’s a detail worth noticing: the CEO of a public company at that scale personally running an autonomous research tool against production code on a Wednesday afternoon. He didn’t delegate to a research team. He didn’t put it on a quarterly roadmap. He just ran it.

A One-Person Marketing Team

On the opposite end of the spectrum — marketing operations instead of infrastructure code — Anthropic revealed a parallel case.

Context: Anthropic, the company behind Claude, went from $9B to $19B in annualized revenue in three months. Its valuation reached $380B. It has over 3,000 employees. This is not an early-stage startup.

Austin Lau is one person. Non-technical. For 10 months, he managed all 6 of Anthropic’s growth marketing channels: paid search, paid social, app stores, email, and SEO. For reference, the industry benchmark for operating those 6 channels is a team of 15 to 20 people with $3M to $5M in annual payroll.

His workflow: Claude Code exports campaign data and flags anomalies automatically, while two specialized agents generate 100 copy and creative variations in 0.5 seconds. For visual assets, a Figma plugin swaps ad templates without manual intervention. On the data side, an MCP server queries Meta’s Ads API directly for real-time metrics — and a memory system feeds learnings from each cycle into the next.

The result: one person operating the marketing engine of a $380B-valued company. Not as a pilot. For 10 months in production.

The caveat needs to be clear: this is Anthropic talking about its own product. The case was published as part of their communications strategy. The numbers haven’t been externally audited. It’s like Toyota publishing a study about its own vehicle reliability — probably true, but the source deserves context. Still, the scale and duration make it hard to dismiss entirely.

Same Pattern, Two Scales

Place these two cases side by side and the structure is identical:

ShopifyAnthropic
DomainInfrastructure codeMarketing operations
MetricParse+render timeCampaign performance
Experiments29 run, 10 keptContinuous cycle
Human roleEvaluate & mergeDirect & scale
Caveat”Somewhat overfit”Self-reported

In both cases, the human doesn’t disappear. They shift position. From execution to direction. Tobi didn’t write the Liquid optimizations — he evaluated them and decided which to keep. Austin didn’t design every ad or adjust every bid — he defined the channels, budget constraints, and success criteria.

This is exactly what we described in the agent operator analysis: both Tobi and Austin are acting as conductors of autonomous agents. The value is no longer in repetitive execution — it’s in knowing what to ask for, how to constrain it, and when to step in.

And the autoresearch loop we analyzed in 630 lines is no longer an academic concept. It’s running at $120B and $380B scale.

The Conditions Nobody Mentions

The easy takeaway from these two cases is: “I need to deploy autonomous agents.” The hard part — and what separates thoughtful adoption from groundless enthusiasm — is asking why they worked.

Both cases share three preconditions:

Clean, quantifiable metrics. Liquid has performance benchmarks: parse time, render time, memory allocations. Anthropic’s marketing has cost per acquisition, CTR, conversion by channel. In both cases, the agent knows exactly what to compare each iteration against. Without that metric, the agent iterates blind.

Ready infrastructure. Neither agent built its own foundation. Shopify already had a mature codebase with test suites and profilers; Anthropic already had advertising platform APIs and connected data systems. The agents operated on top of what existed. If your process lives in shared spreadsheets and forwarded emails, there’s nothing for an agent to operate on.

Leadership willing to experiment. This one is less about technology and more about organizational permission. Tobi ran autoresearch personally — no committee, no quarterly roadmap. Austin had 10 months of operational autonomy because someone at the top signed off. Authorization to experiment came from leadership, not from a six-month approval cycle.

In our experience at IQ Source, the bottleneck is never tool access. Autoresearch is open source. Claude Code is available via subscription. Meta and Google APIs are public. What distinguishes teams that get results is preparation: mapped processes, defined metrics, and documented constraints.

This connects directly to what we analyzed about context engineering: the quality of an agent’s output is proportional to the quality of context it receives. Shopify and Anthropic didn’t use magic tools — they gave existing tools high-quality context.

For B2B companies in Latin America, the question isn’t “can I replicate what Shopify did?” The tools are already available. The right question is: “do I have the preconditions?”

Your First Experiment Doesn’t Have to Be Liquid

Liquid processes billions of requests. Anthropic’s marketing operates with $19B in revenue. The numbers are striking, but you don’t need to operate at that scale to apply the pattern.

What you need is one process with one clear metric. One process. One metric.

Concrete examples for mid-market companies: purchase order accuracy (how many go through without manual correction?), document processing time (from receipt to classification), quote response speed (from request to proposal sent), inventory shrinkage (gap between system count and physical count).

All of these produce a number that goes up or down. Today, improving that number depends on human hands to test adjustments, measure results, and decide what to keep. But an agent can run those iterations overnight while the team rests.

We’re not talking about replacing teams. We’re talking about the team arriving the next morning with 20 tested variations instead of having manually tested 2 last week.

Send us the name of one process and the metric you measure it by today. We’ll send back a quick diagnostic: what’s ready, what’s missing, and what the first viable experiment would look like. No meeting — just one email exchange.

Send your process and metric →

Frequently Asked Questions

AI agents enterprise automation Shopify Anthropic autoresearch AI operations production results

Related Articles

LiteLLM Attack: Your AI Trust Chain Just Broke
AI & Automation
· 7 min read

LiteLLM Attack: Your AI Trust Chain Just Broke

LiteLLM, the AI API key proxy with 97 million monthly downloads, was poisoned via PyPI. Your security scanner was the entry point.

AI security software supply chain LiteLLM
Google Stitch + AI Studio: Design-to-Code Without Engineers
AI & Automation
· 7 min read

Google Stitch + AI Studio: Design-to-Code Without Engineers

Google shipped a full design-to-production pipeline with Stitch and AI Studio. Where it works for B2B prototypes and where you still need real engineering.

Google Stitch vibe coding vibe design