What It Takes to Build Production-Grade Voice AI Agents

We've shipped voice agents handling thousands of real calls across deployments and languages. This is what we learned.

TL;DR: Building a Voice AI demo takes an afternoon. Shipping one that handles real phone calls – across noisy environments, regional accents, mixed languages, and the full spectrum of telephony edge cases – took us months of systematic engineering.

This series covers the full landscape: architecture choices, latency tuning, prompt engineering, cost optimization, testing, production hardening, and observability. Average all-in cost: ~$0.03 per call.

Voice AI has a demo problem. You wire up a speech-to-text engine, connect an LLM, pipe it through text-to-speech, and in an afternoon you have something that sounds impressive on a Loom video.

Then you try to call real people with it.

The user pauses mid-sentence to think. The agent interrupts them. Someone says "nahi lunga" (I won't buy it) and the agent interprets it as a goodbye. Background noise from a truck depot triggers phantom transcriptions. Your recording has a 3-second gap right where the user gave critical feedback. The SIP call fails with a 480 and your system silently marks it as "completed."

The gap between demo and production in voice AI is measured in months, not days.

Most teams underestimate this by an order of magnitude.

At ByondLabs, we've built production voice agents for enterprise clients – systems handling thousands of outbound and inbound calls, collecting structured data, navigating multi-step conversations in Hindi, Hinglish, and English, across regions with varying accents and connectivity. One deployment alone has processed thousands of calls across multiple production campaigns, with every call recorded, analyzed, and costed down to the fraction of a cent.

This post is the first in a series covering what we've learned. Here, we cover the full landscape. The deep dives go into each dimension in detail.

The Architecture Decision That Shapes Everything Else

The first fork in the road: speech-to-speech models (like OpenAI's Realtime API) versus a modular pipeline where you choose and control STT, LLM, and TTS independently.

We chose modular – using a real-time communication framework as the orchestration layer, with dedicated STT, LLM, and TTS providers selected independently for each job.

Why modular wins in production

You can optimize each component independently. When your users speak Hinglish – a fluid mix of Hindi and English with regional slang – no single speech-to-speech model matches dedicated providers tuned for each job. A truck driver might say "Eicher ka Pro 2049 liya hai, lekin spare parts nahi milte" (I bought an Eicher Pro 2049, but can't get spare parts). We needed an STT provider purpose-built for Indian languages, an LLM that could handle domain-specific reasoning, and a neural TTS engine with a natural Hindi voice. Modular lets you pick best-in-class for each.

Cost drops dramatically. A cost-efficient mini-tier LLM is 10-17x cheaper than its flagship counterpart and performs excellently for structured voice conversations. With a bundled speech-to-speech model, you pay a single price regardless of task complexity. Modular lets you right-size per component.

When something breaks, you know where. Our per-turn instrumentation tracks VAD inference time, STT duration, LLM time-to-first-token, and TTS time-to-first-byte for every single conversational turn. When latency spikes or quality drops, we know exactly which component to investigate. With a black-box speech-to-speech model, debugging is guesswork.

The trade-off is real: more moving parts, more integration surface. For production systems where you need control, observability, and cost efficiency – modular wins.

This landscape is evolving fast - native speech-to-speech models with Indian language support are emerging, and the modular-vs-integrated calculus might shift as they mature. But the production engineering challenges covered in this series - latency tuning, prompt design, telephony handling, silence detection, observability - exist on both sides of the architecture decision. The model is one component. Everything around it is where the months go.

Read the Deep dive: Build vs. Buy for Voice AI: Why We Chose Neither (and Both)

Latency: The Invisible Make-or-Break

Users don't consciously think about response time. They just hang up, talk over the agent, or repeat themselves. Sub-2-second response time is the bar for natural conversation, and that budget gets split across five pipeline stages.

The biggest latency killer isn't any single component – it's the orchestration between them.

We spent three months tuning latency on a single deployment and reversed our own decisions multiple times. ML-based turn detection caused high latency with default settings, so we disabled it. Conversations got choppy. We re-enabled it with an aggressively tuned threshold. That worked. We tried preemptive LLM generation to overlap inference with user speech. It backfired for multilingual conversations (partial transcriptions in code-switching languages are unreliable). Disabled.

The uncomfortable truth: there is no magic configuration. Every parameter interacts with every other parameter. The only path is systematic measurement and iteration. Which is why per-turn metrics aren't a nice-to-have – they're the prerequisite for all optimization.

Read the Deep dive: Voice Agent Latency: The Sub-Second Tuning Playbook

Prompt Engineering Is 10x Harder for Voice

Our system prompts across multiple agent personas total over 2,700 lines.

Every section earned its place through a production failure.

If that sounds excessive, consider: a 200-line prompt produces an agent that works in demos. Thousands of lines of explicit state tracking, anti-hallucination rules, domain vocabulary, and edge-case handling produces an agent that works when a truck driver in Rajasthan calls from a noisy depot and says a brand name in his regional accent.

Three problems that don't exist in chat

Hallucination is audibly destructive. When a user says "Pro 2049 liya hai" (I bought a Pro 2049), the LLM desperately wants to be helpful and say "Eicher Pro 2049" – adding the brand name the user never mentioned. In a chat interface, that's a minor annoyance. In a voice conversation, they hear it immediately and lose trust. We drove hallucination rates from ~15-20% of calls to near zero through layered anti-hallucination rules – but it took hundreds of prompt lines to get there.

Goodbye vs. negative feedback. An LLM's instinct is to wrap up when it hears sustained negativity. But negative feedback is often the most valuable data you're collecting. "Spare parts nahi milte" (I can't get spare parts) – keep talking, this is gold. "Time nahi hai, baad mein" (no time, call later) – end the call politely. Teaching the model the precise difference between "nahi lunga" (I'll never buy this again) and an actual desire to hang up required explicit enumeration of dozens of phrases – not a general rule.

Domain vocabulary is infrastructure, not decoration. A single brand name can have six phonetic spellings in Hindi STT output. Numeric product codes follow specific Hindi pronunciation conventions – "pachpan tees" means 5530, not "fifty-five thirty" – and the LLM doesn't know this without being taught. Without 200+ lines of domain terminology in the prompt, the agent can't understand a significant portion of what users say.

We chose prompt engineering over fine-tuning deliberately. Prompts deploy in minutes; fine-tuning takes days. Prompts express precise rules; fine-tuned models learn patterns. And prompt caching makes the token cost of large prompts manageable. Fine-tuning is on our roadmap once we've accumulated enough production transcripts – but prompts get you to production faster and let you iterate without retraining.

Read the Deep dive: Prompt Engineering for Multilingual Voice Agents

Cost: $0.03 Per Call, Not $2.00

Here's the number everyone asks about:

Call Profile	Duration	Cost
Short (screen-out)	~30 seconds	~$0.003
Medium (standard)	~2 minutes	~$0.016
Long (detailed)	~5 minutes	~$0.046
Campaign average	1-3 minutes	~$0.03

Compare that to a human agent making the same call: $0.50-2.00 when you factor in salary, training, management, and time spent on calls that go nowhere. The voice agent handles zero-value calls (wrong numbers, voicemails, immediate hang-ups) for pennies each.

The single biggest cost lever is model tier selection. A mini-tier LLM can be 10-17x cheaper than its flagship counterpart and handles structured voice conversations excellently. We've never needed a flagship model for any voice agent task.

Prompt caching is free money. Most modern LLM APIs cache repeated prompt prefixes at a ~50% discount. With a large system prompt that's identical across turns, roughly two-thirds of input tokens get served from cache. Zero engineering effort, just a lower bill.

We built per-call cost tracking from day one – every call has an auditable breakdown by component, model, and pricing rate.

Cost anomalies are often the first signal of a behavioral regression.

Read the Deep dive: Voice Agent Cost Optimization: Real Numbers from Production

The Parts That Break in Production

The AI components are the easy part. Here's what actually breaks:

Telephony lies to you – and the lies compound. SIP error codes don't mean what the RFC says they mean. Carrier behavior diverges from spec. We expanded our SIP error mappings from 8 to nearly 20 based purely on observed production behavior. But the real cost isn't the unmapped codes themselves – it's the downstream blindness. When your system can't distinguish "customer was busy" from "number doesn't exist" from "number switched off" from "carrier rejected the call," you can't tell whether your contact list is stale, your trunk provider has a routing issue, or customers are genuinely unavailable. That data quality gap feeds back into everything: campaign strategy, retry logic, call scheduling, and the reporting you hand to stakeholders. Telephony also adds its own latency budget that no amount of LLM optimization can claw back – SIP call setup time and first-audio delay vary by carrier and region, eating into the user's patience before the agent even speaks.

Audio quality on real phone networks breaks your pipeline. Speech recognition that works perfectly in testing breaks on production telephony audio. Carrier codec compression, spotty rural connectivity, background noise from real environments – these degrade transcription accuracy in ways lab testing never reveals. A brand name that transcribes perfectly over a clean connection comes through as three different phonetic variants depending on the carrier, the caller's accent, and the ambient noise. The damage cascades: bad transcription means the LLM misunderstands, responds to the wrong thing, and the user hangs up – or worse, stays on the line while the agent confidently discusses something they never said.

Silence is ambiguous. Users think, get distracted, put the phone down. Your agent needs to handle all of these without being aggressive or robotic. The right silence thresholds came from analyzing hundreds of real calls – not from guessing in development.

Interruption handling is a cultural problem, not a technical one. Most frameworks give you duration-based and word-count interruption thresholds. But in Hindi conversation, "haan" and "achha" are constant backchannel signals – the speaker is saying "I'm listening," not trying to take the floor. Duration-based detection cuts the agent off on every listening sound. The right interruption configuration depends on the language, the culture, and the conversation style – and you only find it by analyzing hundreds of real calls.

Read the Deep dive: The Parts of Voice Agents That Break in Production

Testing: You Can't QA a Non-Deterministic System the Traditional Way

Ask your voice agent the same question ten times and you'll get ten different responses. All might be correct. Traditional test assertions break completely.

Our solution: LLM-as-judge evaluation with frozen conversation snapshots. We use PromptFoo to define what a correct response should accomplish (not what it should say), and let a capable evaluator model score the agent's output.

We built 100+ test cases across a dozen configurations. But our most valuable tests didn't come from planning – they came from turning every production bug into a permanent regression test. When a real call reveals that the agent confused two competitor brands, that transcript becomes a frozen test case. After two production campaigns, regression tests caught more issues than our original hand-written suite.

Read the Deep dive: Testing Voice Agents Without Going Insane

Observability: Three Layers, Three Audiences

Every production call generates data across three layers:

Turn metrics (for engineers): per-turn latency breakdowns – VAD, STT, LLM TTFT, TTS TTFB. When someone reports "slow response," we pinpoint the component in seconds.

Call lifecycle (for operations): state machine tracking from initiated through completed/failed, with SIP error mapping, termination reasons, and full transcripts with speaker attribution.

Post-call AI analysis (for the business): dual-layer intelligence – structured JSON extraction (outcome, sentiment, key topics) for dashboards and CRM, plus narrative summaries for human review. This is what turns thousands of calls into actionable intelligence.

Analytics never blocks the agent. If the analytics service goes down, calls continue normally. We've never dropped a call due to an analytics failure.

Read the Deep dive: Voice Agent Analytics & Observability

The Series

Each dimension has its own deep dive:

Build vs Buy – Why we chose to build a modular pipeline instead of existing Voice Platforms
Latency Tuning – Turn detection, endpointing, and why we reversed our own decisions three times
Cost Optimization – Real per-call numbers, the model tier lever, and what we deliberately didn't optimize
Prompt Engineering – State machines, anti-hallucination, and why 2,700 lines isn't overkill
Testing Strategies – LLM-as-judge, frozen snapshots, and regression testing from production bugs
Analytics & Observability – Per-turn metrics, post-call analysis, and the data pipeline
Production Hardening – SIP errors, audio quality, silence detection, and interruption handling

Evaluating Voice AI? Let's Talk.

Building production voice agents is hard -- but the problems are solvable, and the unit economics are compelling.

Everything in this series represents learning that compounds. The three months of latency tuning, the 2,700 lines of prompt engineering, the SIP mappings built from thousands of production calls, the silence detection calibrated across real conversations – that's discovery we've already done. New deployments don't start from zero. They start from a proven baseline: instrumentation already wired, turn detection tuned for the model family, interruption thresholds calibrated for the language, prompt architecture tested across campaigns. What remains is adapting to the specific domain and carrier characteristics – days of targeted work, not months of exploration.

If you're a technical leader evaluating voice AI, you're probably asking one of these questions:

Should we build in-house, use an off-the-shelf platform, or hire a team that's done this before?
What will it actually cost – not just per call, but in engineering time to get to production?
How do we handle our specific language, domain, or compliance requirements?

We can help you shortcut months of iteration – whether that means building with you, auditing your architecture, or helping you evaluate the build-vs-buy tradeoff for your specific use case.

Tell us about your use case – we'll give you an honest assessment of what it takes.

ByondLabs builds production AI systems for enterprises. We specialize in Voice AI, LLM applications, and Real-Time communication systems.