How much can batch APIs reduce enterprise AI costs?

Batch APIs from Anthropic and OpenAI offer a 50% discount on both input and output tokens. The trade-off is a 24-hour processing window instead of real-time responses. Most enterprise workloads — report generation, contract analysis, data enrichment, overnight agent steps — don't need real-time results, making them immediate batch candidates.

How does prompt caching reduce AI API costs for enterprises?

Prompt caching stores repeated prompt prefixes (system prompts, templates, few-shot examples) so subsequent calls pay only 10% of the base input token price — a 90% reduction. For enterprise workloads with standardized templates, this turns the largest cost component (input tokens for instructions) into a near-negligible line item.

Can batch API discounts and prompt caching be combined for AI cost savings?

Yes. Anthropic's documentation confirms batch and caching discounts stack. A cached batch request pays 50% of the already-reduced cache read price. For Claude Sonnet 4.6, that means input tokens drop from $3.00/MTok (standard) to $0.15/MTok (cached + batched) — a 95% reduction on cached input portions.

How to audit enterprise AI API spend to find cost optimization opportunities?

Five steps in one afternoon: pull API usage logs from your provider dashboard, classify each workload by latency sensitivity (real-time vs. can-wait), identify repeated prompt prefixes as caching candidates, calculate your input-to-output token ratio per workload, and map non-urgent workloads to batch API endpoints. Most enterprises find 40-70% savings hiding in workloads already running.

www.iqsource.ai

The Hidden Cost Lever in Enterprise AI: Timing

Ricardo Argüello

The Hidden Cost Lever in Enterprise AI: Timing

Ricardo Argüello — March 15, 2026

Ricardo Argüello

CEO & Founder

March 15, 2026 Business Strategy 11 min read

In February, Claude’s official account posted a promotion: double your usage if you use it outside peak hours. A product analyst named Aakash Gupta broke down the math: Anthropic is spending roughly $7B on inference infrastructure, and their GPUs sit idle about 75% of the week.

That’s not a marketing gimmick. That’s demand shaping — the same mechanism your electric company uses when it charges less for running the dryer at midnight. And it signals something most enterprise AI teams haven’t internalized yet: when you call a model matters as much as which model you call.

In our enterprise AI economics analysis, we covered which models make financial sense and when to build vs. buy. That post answered, “Can we afford AI?” This one answers a different question: “How do we spend less on AI we’re already running?”

What Anthropic’s Off-Peak Promotion Actually Tells Us

The consumer-facing promotion (more usage outside peak hours) is the tip. The enterprise signal underneath is bigger.

AI inference infrastructure follows the same economics as any capacity-constrained utility. GPUs are expensive. They depreciate whether they’re computing or idle. Providers need to flatten the demand curve to improve unit economics — exactly what electricity companies figured out decades ago.

Anthropic’s move mirrors an evolution we’ve already seen in cloud computing. AWS launched spot instances in 2009 as a way to sell spare EC2 capacity at a discount. That experiment became a pricing layer that now accounts for massive enterprise workloads. Reserved instances followed. Savings plans followed those.

AI providers are walking the same path. Today it’s an off-peak consumer promotion. Tomorrow it’s tiered enterprise pricing by time-of-day, committed usage discounts, and capacity reservations. The enterprises that structure their workloads now will be positioned to capture those savings as they arrive.

But you don’t have to wait for future pricing tiers. The levers already exist.

Five Operational Levers Most Enterprises Don’t Know Exist

Every week at IQ Source we review enterprise AI architectures where the team spent months selecting the right model but zero time optimizing how they call it. The API bill arrives, someone panics, and the first instinct is to downgrade to a cheaper model. That’s the wrong move. Before changing what you call, change how you call it.

Batch APIs: the 50% discount hiding in plain sight

Both Anthropic and OpenAI offer batch processing endpoints with a straightforward deal: accept a 24-hour processing window instead of real-time responses, and pay half price on every token.

The question to ask about each workload: does the user wait for this result, or does it show up in a dashboard, report, or inbox later? If it’s the latter, it’s a batch candidate.

Report generation, contract analysis, data enrichment pipelines, content moderation queues, nightly summarizations, email drafts for morning review — these workloads don’t need sub-second latency. They need results by tomorrow morning. That’s exactly what batch APIs deliver, at 50% off.

Prompt caching: 90% off your most repeated instructions

Every API call to a language model includes a system prompt — the instructions that tell the model how to behave. For enterprise applications, that system prompt is often the same across thousands of calls: the same template, the same few-shot examples, the same formatting rules.

Prompt caching stores that repeated prefix so subsequent calls pay only 10% of the base input price. For Claude Sonnet 4.6, that drops cached input tokens from $3.00/MTok to $0.30/MTok.

If your application sends a 2,000-token system prompt with every request, and you make 10,000 requests/day, that’s 20M tokens/day in system prompts alone. Without caching: $60/day. With caching: $6/day. Same model, same output quality, same system prompt — $1,620/month saved on just that one component.

Off-peak scheduling: positioning for the pricing curve

The off-peak promotion for consumers is a preview of where enterprise pricing is heading. Even without formal off-peak enterprise tiers today, structuring your workloads to run during low-demand windows reduces queue times and positions you for time-based pricing when it arrives.

For workloads that already use batch APIs (24-hour window), this happens naturally. For near-real-time workloads that can tolerate a few hours of delay — think overnight report compilation, early-morning data enrichment, weekend batch runs — scheduling them outside business hours in US time zones is a practical hedge.

Request consolidation: fewer calls, better cache performance

Ten separate API calls with the same system prompt don’t cache as efficiently as one consolidated call processing ten items. Each call has overhead — network latency, token parsing, cache lookup. Consolidating where possible reduces per-unit cost and improves cache hit rates.

This doesn’t mean cramming everything into a single enormous prompt. It means looking at your request patterns: if you’re calling the API once per row in a spreadsheet, you can probably batch 20-50 rows per call. If you’re generating individual email drafts one at a time, you can generate a batch and distribute.

Output optimization: paying for tokens you don’t read

This is the lever most teams overlook entirely. Output tokens cost 3-5x more than input tokens across every major provider. Claude Sonnet 4.6 charges $3/MTok for input but $15/MTok for output — a 5x multiplier.

Three quick wins:

Structured JSON output instead of verbose prose. If the downstream system parses the response programmatically, you don’t need the model to write paragraphs. Specify response_format: json and define the schema.
Set max_tokens intentionally. If your classification task needs a one-word answer, don’t leave the default at 4,096 tokens. You won’t pay for unused tokens, but an unconstrained model sometimes produces longer outputs than necessary.
Shorter system prompts. Rewrite instructions for density. “You are a helpful assistant that always responds in JSON format with the following fields…” can usually be compressed by 40% without losing behavior. Fewer instruction tokens = lower cost per call, especially before caching kicks in.

The Math: Before and After for Three Common Workloads

Theory is easy. Let’s run numbers using published Anthropic pricing for Claude Sonnet 4.6 (the model most enterprises use for production workloads).

Contract analysis pipeline: 500 contracts/month

Each contract averages 8,000 input tokens (document text) plus a 2,000-token system prompt (extraction template). Output averages 1,500 tokens (structured JSON with key clauses, dates, parties, obligations).

Before optimization (standard API):

Component	Tokens/call	Calls/month	Monthly tokens	Cost/MTok	Monthly cost
Input (document + system prompt)	10,000	500	5M	$3.00	$15.00
Output (extracted data)	1,500	500	750K	$15.00	$11.25
Total					$26.25

After optimization (batch + caching + output tuning):

The 2,000-token system prompt is identical across all 500 calls — a prime caching candidate. The extraction doesn’t need real-time results — batch candidate. The output is already structured JSON, but tightening the schema eliminates ~20% of output tokens.

Component	Tokens/call	Calls/month	Monthly tokens	Cost/MTok	Monthly cost
Cached input (system prompt, batch)	2,000	500	1M	$0.15	$0.15
Non-cached input (document, batch)	8,000	500	4M	$1.50	$6.00
Output (tighter schema, batch)	1,200	500	600K	$7.50	$4.50
Total					$10.65

Savings: $15.60/month per pipeline — a 59% reduction. For a legal department processing 5,000 contracts/month across multiple templates, multiply that by 10 and the numbers start to matter.

Customer support triage: 10,000 tickets/month

Each ticket: 500 tokens of customer text, 1,500-token system prompt (triage rules, category definitions, priority matrix), 200-token output (category, priority, routing, summary).

Before optimization:

Component	Tokens/call	Calls/month	Monthly tokens	Cost/MTok	Monthly cost
Input	2,000	10,000	20M	$3.00	$60.00
Output	200	10,000	2M	$15.00	$30.00
Total					$90.00

This workload has a critical characteristic: 75% of input tokens are the same system prompt repeated 10,000 times. And triage doesn’t need real-time — a 5-minute delay is invisible in a support queue.

After optimization (batch + caching):

Component	Tokens/call	Calls/month	Monthly tokens	Cost/MTok	Monthly cost
Cached input (system prompt, batch)	1,500	10,000	15M	$0.15	$2.25
Non-cached input (ticket text, batch)	500	10,000	5M	$1.50	$7.50
Output (batch)	200	10,000	2M	$7.50	$15.00
Total					$24.75

That’s $65.25/month back — 72% less than before. System prompt caching does most of the heavy lifting here: 15M tokens/month of instructions that were being re-sent and re-billed on every single call.

Daily report generation: 200 reports/day

Each report: 3,000 tokens of data input, 2,500-token system prompt (report template, formatting rules, section structure), 2,000-token output (formatted report text).

Before optimization:

Monthly volume: 200 × 22 business days = 4,400 reports.

Component	Tokens/call	Calls/month	Monthly tokens	Cost/MTok	Monthly cost
Input	5,500	4,400	24.2M	$3.00	$72.60
Output	2,000	4,400	8.8M	$15.00	$132.00
Total					$204.60

Reports don’t need real-time generation. They need to be ready by 8 AM. The system prompt is identical across all reports. And the output — currently prose paragraphs — could be structured data that a frontend renders, cutting output tokens by ~35%.

After optimization:

Component	Tokens/call	Calls/month	Monthly tokens	Cost/MTok	Monthly cost
Cached input (template, batch)	2,500	4,400	11M	$0.15	$1.65
Non-cached input (data, batch)	3,000	4,400	13.2M	$1.50	$19.80
Output (structured, batch)	1,300	4,400	5.72M	$7.50	$42.90
Total					$64.35

Result: $140.25/month in savings, a 69% drop. Output optimization is the biggest contributor here — switching from prose to structured JSON that the frontend formats saves more than the batch discount alone on output tokens.

The pattern across all three

The savings range from 59% to 72%. Not because of a round number in a slide deck — because the math consistently reveals the same thing: most enterprise workloads share common traits. System prompts repeat across thousands of calls, which makes them prime for caching. Results don’t need to be instant, so batch pricing applies. And output tokens are often inflated with prose that could be structured JSON. Stack those three optimizations and the reductions compound.

How to Audit Your Current AI Spend in One Afternoon

You don’t need a consultant for step one. You need your API dashboard and about four hours.

Start with your API usage logs. Every major provider (Anthropic, OpenAI, Google) has a usage dashboard showing requests by model, token counts (input vs. output), and spend by time period. Export the last 30 days. If you’re using multiple models, separate the data by model.

Next, classify each workload by latency sensitivity. Go through your API integrations and tag each one:

🔴 Real-time — user is waiting for the response (chatbot, live assistant, autocomplete)
🟡 Near-real-time — result needed within minutes (support triage, alert processing)
🟢 Async — result needed within hours or by next morning (reports, analysis, enrichment, overnight agent processing)

In our experience, most teams discover that 60-70% of their API calls fall into the 🟡 or 🟢 categories. Those are your batch candidates.

Now search your codebase for repeated prompt prefixes. How many distinct system prompts do you have? How many API calls share the same one? Any prompt sent more than 100 times/day is a high-value caching candidate. This directly connects to agent orchestration patterns — batch APIs work well for the overnight processing steps in agent workflows.

Then look at your input-to-output token ratio for each workload. If you’re sending 5,000 tokens in and getting 200 back (classification tasks), your cost is input-dominated — caching is your biggest lever. If you’re sending 2,000 in and getting 3,000 back (content generation), output optimization matters more.

Finally, map your 🟡 and 🟢 workloads to batch endpoints. Check whether your provider offers a batch API for that model. Both Anthropic and OpenAI do. The migration is usually straightforward — same prompt format, different endpoint, results returned asynchronously.

What This Means for Your AI Architecture

Choosing the right model is the decision that gets all the attention. In our analysis of tiered model architectures, we showed how routing 90% of requests to efficient models and 10% to premium ones cuts costs dramatically. That’s the “what” lever.

The levers in this post — batch APIs, prompt caching, off-peak scheduling, output optimization — are the “when and how” levers. They stack on top of model selection. A tiered architecture using batch APIs and prompt caching compounds the savings from both approaches.

The enterprises that treat AI inference as a utility cost — analyzing demand curves, optimizing scheduling, caching repeated operations — will spend 40-70% less than those that treat every API call the same way. That gap widens as usage scales.

At IQ Source we run AI cost audits. The process is direct: you share your provider dashboard and API usage logs, we identify which workloads should move to batch endpoints, which prompts are caching candidates, where output tokens are inflated, and how your request patterns could be consolidated. We’ve done this for teams spending $5K/month on API calls and for teams spending $50K — the optimization patterns are the same, the absolute savings just have more zeros.

If your API bill has been climbing and the response has been “switch to a cheaper model” — that’s the wrong conversation. The right one starts with your usage data.

Share your API usage numbers with us — we’ll show you what changes →

Frequently Asked Questions

AI costs batch API prompt caching enterprise AI API optimization cost reduction AI operations

We're AI Consultants. Sometimes We Say: Don't Use AI

Business Strategy

March 24, 2026 · 9 min read

We're AI Consultants. Sometimes We Say: Don't Use AI

An AI consultancy telling clients 'skip the AI' sounds contradictory. But it's the most valuable thing we do.

AI strategy decision making AI ROI

The 100x Employee Already Exists (And Changes How You Hire)

Business Strategy

March 24, 2026 · 6 min read

The 100x Employee Already Exists (And Changes How You Hire)

One AI-literate professional now produces what used to take a team. Jensen Huang confirmed it at GTC 2026. Here's what it means for your hiring strategy.

artificial intelligence talent hiring

B2B Enterprise Services

Software Development

Digital Marketing

Free Tools

The Hidden Cost Lever in Enterprise AI: Timing

The Hidden Cost Lever in Enterprise AI: Timing

General summary

What Anthropic’s Off-Peak Promotion Actually Tells Us

Five Operational Levers Most Enterprises Don’t Know Exist

Batch APIs: the 50% discount hiding in plain sight

Prompt caching: 90% off your most repeated instructions

Off-peak scheduling: positioning for the pricing curve

Request consolidation: fewer calls, better cache performance

Output optimization: paying for tokens you don’t read

The Math: Before and After for Three Common Workloads

Contract analysis pipeline: 500 contracts/month

Customer support triage: 10,000 tickets/month

Daily report generation: 200 reports/day

The pattern across all three

How to Audit Your Current AI Spend in One Afternoon

What This Means for Your AI Architecture

Frequently Asked Questions

Related Articles

We're AI Consultants. Sometimes We Say: Don't Use AI

The 100x Employee Already Exists (And Changes How You Hire)

IQ Source Assistant