Build vs. Buy for Voice AI: Why We Chose Neither (and Both)

Part of the series: What It Takes to Build Production-Grade Voice AI Agents

TL;DR: We didn't use Voice AI platform (Vapi, Bolna, Bland, Retell) and we didn't build from scratch. We assembled a modular pipeline on top of an open-source real-time communication framework, with dedicated best-in-class providers for STT, LLM, and TTS.

The result: per-call costs 2-10x lower than platforms, full control over every parameter that matters in production, and the ability to swap any component without rewriting the system. Platforms get you to a demo in hours. Getting from demo to production-grade requires control over both the components and the parameters that platforms abstract away by design.

Every team evaluating Voice AI eventually asks the same question: should we use a platform, or build it ourselves?

The honest answer is that "build vs. buy" is the wrong framing. The real question is: where do you draw the abstraction boundary?

Platforms like Vapi, Bolna, Bland, and Retell bundle STT, LLM, TTS, and telephony into a single API. You configure an agent, point it at a phone number, and start making calls. Some offer hosted LLM selection, knowledge base integrations, and pre-built telephony. The pitch is compelling: voice agents in minutes, not months.

We evaluated several platforms before our first production deployment. We chose to build on a modular stack instead – an open-source real-time communication framework for media orchestration, with independently selected providers for each AI component.

The Cost Math

This is where the numbers speak clearly.

Platform per-minute pricing typically ranges from $0.05 to $0.30/min depending on the provider, model selections, and usage tier. Most include platform fees, AI services, and telephony in a bundled rate. Pricing changes frequently – check current rates before modeling your costs.

Our modular stack: ~$0.03 per minute all-in – AI components, infrastructure, and telephony. Campaign average: ~$0.03 per call.

The difference: 2-10x depending on which platform you're comparing against.

Approach	Per-minute cost	Per-call average	10,000 calls
Platform (typical range)	$0.07-0.30	$0.07-0.30	$700-3,000
Our modular stack	~$0.03	~$0.03	~$300

At 10,000 calls, platforms cost $700-3,000. We spend ~$300. That delta funds engineering. And it compounds at scale – platform pricing is per-minute, so your 100,000th call costs the same as your first. With a modular stack, you negotiate volume pricing directly with each provider.

The Model Tier Lever You Can't Pull

The single biggest cost lever in voice AI is LLM model tier selection – mini-tier models are 10-17x cheaper than flagship models and handle structured voice conversations excellently. We've never needed a flagship model for any voice agent task.

Most platforms either lock you into a specific model, charge a markup on model usage, or give you model selection but bundle it with their own per-minute fee on top. You're paying twice: once for the platform's infrastructure, and again for the AI services underneath.

With a modular stack, model selection is a configuration change. When a new cost-efficient model releases, we switch the same day. No platform approval, no pricing renegotiation, no waiting for the platform to "support" the new model.

The Control Gap

The demo-to-production gap in voice AI is measured in months, not days. And the hardest production problems live in the parameters that platforms abstract away.

Turn detection tuning. We spent three months tuning turn detection for Hindi/Hinglish speakers and reversed our own decisions multiple times. ML-based turn detection with default parameters caused high latency. We disabled it. Conversations got choppy. We re-enabled it with an aggressively permissive threshold. That worked. This level of parameter control – specific threshold values, toggling between endpointing strategies, disabling duration-based interruption detection while keeping word-count detection – is exactly what platforms abstract away. It's also exactly what determines whether your agent sounds natural or robotic.

Interruption handling. We disabled duration-based interruption detection entirely and kept only word count (minimum 2 words) because Hindi backchannel signals – "haan," "achha" – were triggering false interruptions on every utterance. This is a cultural and linguistic tuning problem. A platform that gives you a "sensitivity slider" cannot express this.

STT provider selection. We migrated from a general-purpose STT provider to one purpose-built for Indian languages because a single brand name was getting transcribed six different ways by the general-purpose provider. The specialized provider costs 25% more per minute but reduces misunderstandings, shortens calls, and produces better data. On a platform, you get whatever STT the platform offers.

Silence detection. We built a custom silence detection plugin with an 8-second threshold (we started at 3-4 seconds – far too aggressive in production), configurable nudge prompts in Hindi, grace periods that account for agent processing time, and nudge counter management that resets only on user speech, not agent speech. Try configuring this through a platform API.

The Observability Gap

We emit per-turn latency metrics for every conversational turn: VAD inference time, end-of-utterance delay, STT duration, LLM time-to-first-token, LLM total duration, TTS time-to-first-byte. In production, every turn emits a structured log line:

turn latencies: VAD=45ms, EOU(eos/trans)=120ms/890ms, STT=950ms,
  LLM=1200ms (TTFT=450ms), TTS=650ms (TTFB=120ms)

Structured Log on every turn

When someone reports "the agent was slow," we pinpoint the component in seconds. We track per-call cost breakdowns with a pricing audit trail that distinguishes "costs went up because the model is more expensive" from "costs went up because conversations got longer." We run post-call AI analysis with agent-specific extraction schemas that produce structured JSON for dashboards and CRM sync.

Platforms give you call logs, maybe transcripts, sometimes analytics dashboards. What they don't give you is the ability to instrument at the turn level, build custom metrics aggregators, or pipe data into your own analytics infrastructure. When your agent misbehaves on call #4,387 of a 10,000-call campaign, the question isn't "did the call fail" – it's "which component caused the failure, on which turn, and why." Platforms can't answer that.

The Integration Surface Argument

The strongest argument for platforms: fewer integration points. Wire up one API instead of five.

This is real. A modular stack means integrating STT, LLM, TTS, telephony (SIP trunks), and a real-time media framework. Each has its own SDK, authentication, error handling, and failure modes. We maintain SIP error code mappings for nearly 20 status codes because carriers reinterpret RFC codes liberally. We handle audio codec quirks across Indian phone networks. We manage recording pipelines, analytics services, and cost tracking systems.

But here's what we've observed: the integration complexity is front-loaded. Once the pipeline is wired, it's stable. Components communicate through well-defined interfaces. Swapping an STT provider means changing a configuration and updating the provider adapter – not rewriting the agent. The ongoing engineering cost is in tuning, prompt iteration, and production hardening – work you'd be doing on a platform too, just with less control.

On a platform, the integration is simpler, but the ceiling is lower. When you hit a limitation – and you will, the moment your use case deviates from the platform's assumptions – you're either requesting a feature, building a workaround, or migrating.

Migration from a platform to a custom stack is a rewrite. Migration between STT providers on a modular stack is a configuration change.

When Platforms Make Sense

We're not arguing that platforms are never the right choice. They make sense when:

You're validating a use case. If you need to test whether voice AI works for your business before investing in engineering, a platform gets you there in days. Run a pilot. Measure outcomes. If the use case validates, you can make the build decision with data.

Your use case is straightforward. English-only, standard conversation flows, no domain-specific vocabulary, moderate volume. If your agent is essentially a scripted IVR with an LLM, platforms handle that well.

You don't have voice infrastructure expertise. SIP trunks, real-time media, audio codecs – these are specialized domains. If your team doesn't have this expertise and the use case doesn't justify acquiring it, a platform abstracts away real complexity.

Volume is low enough that cost isn't a constraint. At 1,000 calls per month, the 3-10x cost difference is hundreds of dollars. At 100,000 calls per month, it's tens of thousands.

When Platforms Break Down

Multilingual and code-switching. Hinglish speakers don't just switch languages between sentences – they switch mid-sentence. Tuning STT, turn detection, interruption handling, and prompt behavior for this requires parameter-level control that platforms don't expose.

Domain-heavy conversations. When your users speak in industry slang, use occupation-specific terminology, and pronounce numeric codes using language-specific conventions, you need 200+ lines of domain vocabulary in your prompt and an STT provider selected specifically for your language and domain. Platforms give you a knowledge base upload; you need prompt infrastructure.

Production-grade reliability requirements. When you need to map every SIP error code to a specific business status, track every call through a deterministic state machine, ensure analytics never blocks a live call, and produce an auditable cost breakdown for every call – you need infrastructure, not a platform.

Cost sensitivity at scale. At ~$0.03 per call versus $0.07-0.30 per call, the economics diverge sharply beyond a few thousand calls per month.

What We Actually Built

Our stack isn't "built from scratch." It's assembled from best-in-class components:

Real-time communication framework – open-source, handles media routing, participant management, and recording
STT – dedicated provider purpose-built for Indian languages, with a general-purpose fallback
LLM – mini-tier model via a cloud provider, selected for cost-performance on structured conversations
TTS – neural voice engine with natural Hindi voice
Telephony – SIP trunks from a major telephony provider
Analytics – custom service (FastAPI + PostgreSQL) for call lifecycle, transcripts, recordings, cost tracking, and AI analysis

What we built ourselves: the orchestration layer that wires these together, the prompt infrastructure (2,700+ lines across agent personas), the silence detection plugin, the per-turn metrics aggregator, the cost tracking system, and the analytics pipeline. This is the production hardening that platforms skip and that determines whether an agent works in demos or in the real world.

The framework handles the hard real-time media problems. The providers handle STT, LLM, and TTS. We handle everything in between – which turns out to be where all the production engineering lives.

The Decision Framework

When evaluating build vs. buy for voice AI:

Run the cost model. Calculate per-call cost on a platform versus a modular stack at your expected volume. If the delta is material, the engineering investment in a modular stack pays for itself.
Assess your language and domain complexity. If you need multilingual support, domain-specific vocabulary, or cultural tuning of conversation dynamics, you'll outgrow a platform quickly.
Evaluate the observability requirement. If "the agent sounds fine" is sufficient quality assurance, platforms work. If you need per-turn metrics, component-level debugging, and auditable cost tracking, you need infrastructure.
Consider the migration cost. Starting on a platform and migrating later means rewriting your agent. Starting modular and adding platform-like conveniences is incremental. The asymmetry favors starting modular if you expect to need control eventually.
Be honest about your team. Real-time media, SIP, and audio processing are specialized domains. If your team doesn't have this expertise, factor in the learning curve – or hire a team that's done it.

The voice AI platform market is young and moving fast. Today's limitations may be tomorrow's features. But the fundamental tension remains: platforms optimize for time-to-demo, and production voice agents are defined by the months of engineering after the demo. The parameters that matter most in production – turn detection thresholds, interruption handling, STT provider selection, silence detection behavior, prompt architecture, cost tracking – are exactly the parameters that platforms abstract away.

Your mileage will depend on where your use case falls on the complexity and scale spectrum. We chose the modular path because we needed production-grade control at production-grade cost. For our use case – multilingual, domain-heavy, high-volume, enterprise-grade – it wasn't a close decision.

This post is part of our series on building production-grade voice AI agents. Stay tuned for the other deep dives: Latency Tuning, Cost Optimization, Prompt Engineering for Voice, Testing Strategies, Analytics & Observability and Production Hardening.

Evaluating whether to build or buy your voice AI stack? Tell us about your use case – We've been through this decision across multiple deployments and can help you model the cost, complexity, and timeline tradeoffs for your specific use case – before you commit to a path.