Skip to main content

The Cheap Model Trap: How AI Providers Capture Ecosystems

Google at $0.25/M tokens, OpenAI at $0.05/M. Not charity — it's platform capture applied to AI. What the pricing war means for your B2B independence.

The Cheap Model Trap: How AI Providers Capture Ecosystems

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

Business Strategy 8 min read

Google launched Gemini 3.1 Flash-Lite at $0.25 per million input tokens. OpenAI has GPT-5 Nano at $0.05. And the usual LinkedIn take is: “AI is becoming a commodity, prices are going to zero.”

That’s not what’s happening.

What’s happening is a pattern we’ve already seen in maps, cloud storage, and databases: offer the entry layer nearly free to capture the layer where the real money lives. And if your company is adopting AI at scale, you should understand the mechanics before the hook becomes an invoice.

The numbers that don’t add up alone

Here are the input prices per million tokens for the cheap inference models available in March 2026:

ModelInput / 1M tokensOutput / 1M tokensProvider
GPT-5 Nano$0.05$0.40OpenAI
Gemini 2.0 Flash-Lite$0.075$0.30Google
GPT-4.1 Nano$0.10$0.40OpenAI
Gemini 3.1 Flash-Lite$0.25$1.50Google
Claude Haiku 4.5$1.00$5.00Anthropic
Gemini 2.5 Pro$1.25$10.00Google

The strategic question isn’t “which one is cheapest?” — that’s obvious. The question is: if Google and OpenAI are losing money or barely covering inference costs on these tiers, what are they gaining in return?

Look at the bottom of the table. The premium models — that’s where the money is.

It’s not competition. It’s platform construction.

This pattern has a name in business strategy: loss leader pricing. You offer a product below cost to pull customers into your ecosystem, where the profitable products are waiting.

Google did it with Maps. Between 2013 and 2018, the Google Maps API was essentially free. Thousands of startups and companies built it into their products. When the installed base was large enough, Google raised prices by 1,400%. The companies that had built their entire product on that API had nowhere to go — the migration cost exceeded the price increase.

Amazon did the same with AWS. S3 launched with prices that made competition impossible. Once your data, your pipelines, and your engineering team lived in AWS, the cost of leaving was prohibitive. Amazon’s cloud computing margins keep climbing year after year.

Now the same logic applies to AI models. The cheap model attracts volume. That volume needs data pipelines. Those pipelines live on Vertex AI, Azure OpenAI, the provider’s cloud platform. And once the infrastructure is there, demand for premium models — where the real margins are — generates itself.

Nobody’s plotting anything. It’s just a really effective sales funnel.

The three workloads moving the real money

Enterprise AI adoption is going from ~10% to ~50% of organizations. But the growth isn’t coming from complex reasoning tasks — it’s coming from three high-volume workload categories that fit cheap models perfectly.

Moderation and classification

Every company with a digital channel needs to classify content: support tickets, comments, forms, requests. It’s the perfect workload for a cheap model — high volume, low complexity, minimal latency tolerance.

If your company processes 300,000 classifications per month and each one consumes ~500 input tokens and ~100 output tokens:

  • With Claude Haiku 4.5: ~$150/mo input + ~$150/mo output = $300/mo
  • With GPT-5 Nano: ~$7.50/mo input + ~$12/mo output = $19.50/mo

The delta is ~$280/mo, or $3,360/year. Multiply by five similar workloads, and that’s $16,800/year. The difference isn’t trivial — and it’s exactly what makes the switch attractive.

But that switch pulls you into the ecosystem. Your team configures pipelines in OpenAI’s console. Logs go to their dashboard. Fine-tuning uses their format. And six months later, when OpenAI raises prices on the premium tier you started using for complex tasks, the cost of switching is higher than the increase.

Translation and catalogs

Google has an unfair advantage in translation workloads, and it has nothing to do with model quality. Google Translate already has the commercial relationship with many companies doing translation at scale. Gemini Flash-Lite positions itself as the natural upgrade — “you already use our translation, now use our model for everything else.”

For a B2B company with a catalog of 50,000 products in three languages, the difference between $0.25/M and $1.00/M input tokens can mean $15,000-$20,000 per year. That’s enough to justify the migration. And once you migrate, the catalog, custom terminologies, and approval workflows live on Google’s platform.

Intent routing

What matters in intent routing isn’t which model you pick — it’s where the decision logic lives. This workload (classifying what the user wants before acting) is ideal for cheap models: short responses, critical latency, high volume. But the architecture defines your level of dependency:

AspectProvider-coupled routingDecoupled routing
Where the logic livesProvider console (Vertex, Azure)Your code / your abstraction layer
Switching modelsReconfigure pipelines + retrainChange a config variable
ObservabilityProvider dashboardYour monitoring system
Vendor lock-inHighLow

The second column takes more upfront work. But your stack stays under your control.

Cascade architecture: the technical piece

We covered the cost math of tiered models in our article on AI economics in 2026. The piece I didn’t address there: where the routing logic lives matters more than which model you pick.

Cascade architecture works like this: 90% of requests go to the cheap model (Flash-Lite, GPT-5 Nano) and only the 10% requiring complex reasoning scales up to the premium model (Gemini Pro, GPT-5, Claude Sonnet).

The arithmetic is convincing. Assume a mix of 100,000 requests:

  • No cascade (everything to the premium model at $3/M input tokens): cost ~$300
  • With cascade (90% at $0.10/M + 10% at $3/M): cost ~$39

That’s an ~87% cost reduction. The engineering is solid.

But there’s a detail omitted from nearly every provider presentation: where the routing logic lives.

If the cascade logic is in the Vertex AI console or OpenAI’s playground, the provider controls what scales and what doesn’t. If the logic is in your code — a function that evaluates complexity and decides which endpoint to call — you’re in control.

Technically, it’s maybe 50 lines of code. Strategically, it’s the difference between choosing your next provider and having it chosen for you.

The counterargument worth making

“Multi-provider” sounds great in an architecture presentation. In reality, it’s expensive and complex.

Maintaining active integrations with three AI providers means three SDKs, three response formats, three pricing models, three sets of rate limits. For a mid-market company, the operational overhead can exceed the savings.

The goal isn’t to use every model on the market. The goal is being able to switch if you need to. There’s a difference between having the door open and walking through it every day.

Three signals that lock-in is already a real risk for your company — beyond the specific provider you use, as we detail in our vendor selection guide:

  1. Your team says “the Vertex pipeline” or “the Azure endpoint” without thinking about it. If the platform is already part of the operational vocabulary, migration isn’t just technical — it’s cultural.
  2. Switching the cheap model in your application would take more than one sprint. If the answer is “we’d have to touch twelve services,” the coupling is already there.
  3. You’ve never tested the same workload on another provider. You don’t know how much it would cost, how long it would take, or whether quality would be acceptable. You’re assuming switching cost is high without measuring it.

What to do if you’re already using cheap models at scale

Four concrete actions you can execute this week:

1. Search for provider SDK dependencies in your code. A grep -r "openai\|vertexai\|google.generativeai" src/ tells you how many files depend directly on the provider. If it’s more than three, you need an abstraction layer. An MCP server can work as that intermediate abstraction layer between your application and the models.

2. Measure your real “switching delta.” Take your highest-volume workload. Run it on the most viable alternative provider. Measure cost, latency, and quality. If you’ve never done this, you don’t have data to negotiate or decide — you just have inertia.

3. Get the pricing agreement in writing. If your current price is “negotiated” with the provider, ask: what happens if I double the volume? What if I cut it in half? How long does this price last? Discounts without contracts are retention disguised as generosity.

4. Name your fallback for each workload. For each workload in production, your team should be able to answer: “if the current provider doubles prices or goes down tomorrow, we use X.” If the answer is silence, that’s your priority.

What we tell our clients

Here’s the rule of thumb we use with our clients:

  • Under 50,000 calls/day: Use whichever cheap model you prefer. Lock-in isn’t your problem yet — the cost of abstracting is higher than the risk. Focus on making the product work.

  • 50,000 to 500,000 calls/day: Implement the adapter pattern now. A common interface between your business logic and the provider SDK. You don’t need active multi-provider — you need the switch to be possible in days, not months.

  • Over 500,000 calls/day: At this volume, you’re spending real money. The difference between $0.05 and $0.25 per million tokens is tens of thousands of dollars a year — and your provider’s sales team knows your numbers better than you think. This is where those “aggressive discounts” come with 12-24 month commitments attached.

If you’re in the second or third category, reach out. A 45-minute conversation is usually enough to map your actual exposure — and figure out whether you need to act now or just keep it on your radar.

Frequently Asked Questions

AI strategy vendor lock-in AI costs Google Gemini OpenAI enterprise architecture vendor selection

Related Articles

The 100x Employee Already Exists (And Changes How You Hire)
Business Strategy
· 6 min read

The 100x Employee Already Exists (And Changes How You Hire)

One AI-literate professional now produces what used to take a team. Jensen Huang confirmed it at GTC 2026. Here's what it means for your hiring strategy.

artificial intelligence talent hiring