Voice Agent Cost Optimization: Real Numbers from Production

Part of the series: What It Takes to Build Production-Grade Voice AI Agents

TL;DR: A well-architected voice agent costs $0.015-0.025 per call – 50-100x cheaper than a human agent doing the same work. Model tier selection is the single biggest lever (up to 17x impact), prompt caching is essentially free savings (up to 50% on input tokens), and STT accuracy matters more than per-minute price. We tracked every token, audio second, and character across thousands of production calls to arrive at these numbers.

"What does it actually cost per call?"

This is the first question in every conversation we have about production voice agents. And the answer surprises people – not because it's high, but because it's low, and because real production numbers are nearly impossible to find online.

Most voice AI content either hand-waves with "it depends" or quotes theoretical pricing without accounting for how real conversations consume resources. We've run thousands of production calls across multiple deployments – outbound campaigns, inbound support, structured data collection – and tracked every token, audio second, and character.

Compare that to what you're replacing. A human agent making the same two-minute call costs $0.50-2.00 when you factor in salary, training, management overhead, and the time spent on calls that go nowhere – wrong numbers, voicemails, immediate hang-ups. The voice agent handles those zero-value calls at $0.003 each. And the cost structure scales linearly: your 10,000th call costs the same as your first.

Here's what the numbers actually look like.

The Real Cost Breakdown

Across our production campaigns, we observed three broad call profiles:

Call Profile	Duration	Total Cost
Short (quick screen-out)	~30 seconds	~$0.003
Medium (standard conversation)	~2 minutes	~$0.016
Long (detailed conversation)	~5 minutes	~$0.046

The majority fall in the medium range. Campaign average: $0.015-0.025.

For a typical two-minute call, the per-component breakdown:

Component	Cost Range
LLM (reasoning)	~$0.005-0.008
STT (transcription)	~$0.004-0.006
TTS (speech synthesis)	~$0.004-0.006
Recording (platform-level)	Included in infra
Total	~$0.016-0.020

The cost splits roughly evenly across three AI components. That evenness matters – no single component dominates, so you need a strategy for each.

Here's a real cost log from a production call:

Cost Summary:
  LLM Input:  $0.006 (39K tokens)
  LLM Output: $0.000 (207 tokens)
  LLM Cached: $0.002 (26K tokens)
  STT:        $0.005 (66s audio)
  TTS:        $0.005 (544 chars, 28s audio)
  TOTAL:      $0.018

Two things jump out. First, output tokens are nearly free – it's input tokens that drive LLM cost. Second, over 26,000 tokens were served from cache, which we'll get to shortly.

The Model Tier Lever: 17x and Counting

This is the only lever that can change your cost by an order of magnitude.

The pricing spread across LLM tiers is enormous. Mini-tier models run at a fraction of the cost of flagship models – we're talking a 10-17x difference. And for voice agents handling structured conversations – where interaction follows a defined flow with known question types – mini-tier models perform remarkably well.

We match model to task across our deployments:

Structured conversation agents (defined flow, predictable questions): Mini-tier model at moderate temperature. Warm and conversational, not deeply analytical.
Persuasion-heavy agents (objection handling, nuanced responses): A slightly more capable mini-tier variant at lower temperature. The 2-3x cost increase buys meaningfully better objection handling.
Post-call analysis (offline summarization): Mini-tier model. Runs after the call, latency irrelevant, task well-structured.

We've never needed a flagship model for any voice agent task. The 10-17x premium buys reasoning capabilities that structured voice conversations simply don't require. The lesson isn't "always use the cheapest model" – it's "match model capability to task complexity."

Prompt Token Dominance: Where Money Actually Goes

On a typical call, input tokens vastly outnumber output tokens. Our system prompts range from 350 to 1,200+ lines depending on the agent persona. Every conversational turn, the full system prompt plus accumulated conversation history gets sent to the LLM. By mid-call, you're sending tens of thousands of input tokens per turn.

This is where prompt caching becomes significant. Most LLM providers now cache repeated prompt prefixes at a ~50% discount on input token costs. The system prompt is identical across turns, so it caches perfectly.

In a typical call, roughly two-thirds of input tokens are served from cache. That's a substantial cost reduction with zero engineering effort – your framework just needs to pass the system prompt as a stable prefix (which most modern voice agent frameworks do automatically).

This is essentially free optimization. No prompt restructuring, no response caching layer, no infrastructure. Just a lower bill.

STT: Cheapest Per-Minute Does Not Equal Cheapest Per-Call

When we evaluated STT providers for Hindi, we found that a specialized provider built for Indian languages cost ~25% more per minute than our primary general-purpose provider. But the specialized provider demonstrated meaningfully better accuracy on Hindi/Hinglish speech – particularly with regional accents and domain terminology.

Better accuracy means fewer misunderstandings. Fewer misunderstandings mean fewer clarification loops. Fewer clarification loops mean shorter calls and better data quality.

Optimize for per-call cost, not per-minute rate. A provider that costs 25% more per minute but reduces average call duration by 15% through fewer misunderstandings is the cheaper option in practice.

TTS: Characters, Not Duration

Our TTS provider charges per character, not per second of generated audio. This means the cost lever is response length, not how fast the voice speaks.

The implication: prompt engineering that produces concise responses is simultaneously a cost optimization and a UX improvement. We tuned prompts for brevity not for cost reasons, but because phone users don't want monologues. The cost savings were a side effect.

Hindi responses are naturally shorter than English equivalents for the same semantic content – a minor but real cost advantage for Hindi-first agents.

The Cost Tracking System

Early on, we realized cost data without auditability is useless. We built a centralized system with:

A centralized pricing registry. All pricing constants in a single module, organized by component and model name. One place to update when providers change pricing or we add models. Each entry includes per-token or per-second rates and the provider name for the audit trail.

Automatic usage collection hooks. Our framework's built-in usage collector aggregates token counts, audio durations, and character counts throughout the call. No per-request instrumentation needed.

End-of-call batch calculation. Costs computed once per call, not per turn. Zero overhead during the conversation.

Full audit trail. Every cost record stores raw usage, computed costs, model identifiers, and exact pricing rates. When we migrated between model versions for some agents, we could immediately distinguish "costs changed because the model is more expensive" from "costs changed because conversations got longer." That kind of auditability pays for itself the first time pricing shifts.

Analysis costs tracked separately. Post-call LLM analysis (typically $0.002-0.005 per call) is a distinct line item from conversation costs.

Cost Levers, Ranked by Impact

Model tier selection (up to 17x) - The only lever that moves cost by an order of magnitude.
Prompt caching (up to 50% on input tokens) - Free if your framework supports it.
Concise responses (proportional TTS reduction) - 30% shorter responses = 30% cheaper synthesis.
STT feature stripping (marginal but free) - Disable punctuation, diarization, smart formatting if you're not using them.
Platform-level recording (eliminates a cost category) - Use your real-time framework's native recording instead of processing audio through additional AI pipelines.

What We Deliberately Didn't Do

No response caching. Voice conversations are inherently unique. Cache hit rates would be negligible.

No prompt compression. Our system prompts are large (1,200+ lines for some agents), and there's a temptation to compress them to reduce input tokens. We decided against this. The prompts encode critical behavioral rules – when to end a call, how to distinguish negative feedback from a genuine goodbye, domain-specific vocabulary mappings. Compressing these risks subtle behavioral regressions that are hard to detect and expensive to debug in production. The 50% cache discount on repeated prompt tokens makes the uncompressed cost acceptable.

No cheaper STT fallback routing. The complexity of maintaining two STT integrations and a routing classifier wasn't justified by potential savings.

No per-turn cost monitoring. Costs computed at call end, not mid-conversation. Per-turn tracking would add latency for negligible benefit.

Key Takeaways

$0.015-0.025 per call for a well-architected agent handling 1-3 minute structured conversations.
Model tier is the dominant variable. Mini vs. flagship = 10-17x difference. Match model to task.
Prompt caching is free money. Track cached tokens to verify it's working.
Accuracy beats cheapness in STT. Higher accuracy shortens conversations, even if per-minute rate is higher.
Build auditable cost tracking from day one. You can't optimize what you can't measure.

Don't over-optimize. At two cents per call, shaving another 10% is negligible. Invest in call quality instead. A call that collects data in 90 seconds is cheaper than one that takes 4 minutes because the agent misunderstood twice.

This post is part of our series on building production-grade voice AI agents. Stay tuned for the other deep dives: Build vs Buy, Latency Tuning, Prompt Engineering for Voice, Testing Strategies, Analytics & Observability and Production Hardening.

Want a cost projection for your voice AI use case? Let's run the numbers together - We can model per-call economics for your specific conversation type, language, and volume – and identify which levers will have the most impact before you write a line of code.