Voice Agent Latency: The Sub-Second Tuning Playbook

Part of the series: What It Takes to Build Production-Grade Voice AI Agents

TL;DR: Latency tuning in production voice agents isn't a checklist – it's a system dynamics problem where every parameter interacts with every other. We spent three months tuning and reversed our own decisions multiple times. The key lessons: smart turn detection beats raw silence thresholds (but only when tuned aggressively), preemptive generation backfires for code-switching languages, STT features designed for transcription products are overhead for voice agents, and per-turn instrumentation is the prerequisite for all of it. If you can't measure every component of every conversational turn, you're tuning blind.

If you're evaluating voice AI for your business, latency is the metric that will determine whether users engage or hang up. Humans are ruthless judges of conversational pacing – a 3-second pause feels like an eternity. Get response time wrong and it doesn't matter how good your AI is. Users won't stay on the line to find out.

This post is about what latency tuning actually looks like in practice: not a list of optimizations, but a system dynamics problem where the "optimal" configuration can only be found through systematic measurement and iteration. We'll share the specific decisions we made – and reversed – across three months of production tuning on a Hindi/Hinglish voice agent handling thousands of telephony calls. The patterns generalize across stacks, but the specifics matter.

The Latency Budget

A voice agent's responsiveness lives or dies within a 2-3 second window. That's the total budget from the moment a user stops speaking to the moment they hear the agent's first syllable. Exceed it, and users hang up, repeat themselves, or talk over the agent.

Here's how that budget breaks down:

User stops speaking
  --> VAD (Voice Activity Detection): ~30-50ms
  --> EOU (End-of-Utterance detection): ~100-900ms
  --> STT (Speech-to-Text finalization): ~200-950ms
  --> LLM (Language Model generation): ~800-1500ms total, TTFT ~300-500ms
  --> TTS (Text-to-Speech synthesis): ~400-700ms total, TTFB ~80-150ms
Agent starts speaking

Latency Breakdown

The critical insight: TTFT (Time to First Token) and TTFB (Time to First Byte) matter far more than total processing time. Because everything streams, the user hears the agent begin responding as soon as the first TTS audio chunk is ready – not when the entire response is generated. Total LLM time might be 1200ms, but if TTFT is 450ms, the user only experiences that 450ms delay (plus TTS TTFB).

A per-turn metrics log line from our system:

turn latencies: VAD=45ms, EOU(eos/trans)=120ms/890ms, STT=950ms,
  LLM=1200ms (TTFT=450ms), TTS=650ms (TTFB=120ms)

If you can't produce a per-turn latency log for every single turn, you're tuning blind.

Turn Detection: The Biggest Latency Trap

Turn detection – determining when a user has actually finished speaking versus just pausing to think – is the single most consequential latency decision. Get it wrong in one direction and you cut users off. Get it wrong in the other and they sit in silence.

Starting point: We used the framework's ML-based turn detection model with default parameters, plus conservative endpointing delays (1 second minimum, 5 seconds maximum). Result: high latency. Users waited over 2 seconds in silence before the agent responded.

First move – disable it entirely: We dropped the ML model and fell back to pure endpointing with a tight minimum delay. Faster, but crude. Hindi speakers regularly pause mid-thought for a second or more. The agent started jumping into natural pauses. Users complained about interruptions instead of slowness.

The reversal (three weeks later): We re-enabled the ML model with an extremely permissive threshold – making it trigger on almost any pause. At the same time, we removed manual endpointing delays entirely, letting the model handle turn boundaries on its own.

The lesson:

Don't disable smart features when they cause latency. Tune their sensitivity way down instead.

A turn detection model at a permissive threshold adds minimal latency while still catching cases where the user is clearly mid-utterance despite a brief silence. The model's value at this setting isn't in being conservative – it's in preventing the worst false-positive interruptions.

Preemptive Generation: A Latency Win That Backfired

Preemptive generation starts the LLM generating a response before the user finishes speaking, using partial transcription. The theory: overlap LLM inference with the tail end of speech, saving hundreds of milliseconds.

We enabled it. Within a day, we disabled it.

Our conversations happen in Hinglish – Hindi-English code-switching where a user might start a sentence in Hindi and finish in English. Partial transcriptions mid-utterance are frequently wrong. When preemptive generation fires on an incorrect partial, the LLM produces a response to the wrong input, which gets discarded when the final transcription arrives. The "optimization" became double-generation most of the time.

For monolingual English agents with predictable patterns, this probably works well. For multilingual agents with code-switching, it's a trap. This is the kind of thing you can only discover through production measurement – it looks like a pure win on paper.

STT Optimization: Death by a Thousand Flags

On the same day we disabled the ML turn detection model (before the reversal), we also overhauled our STT configuration. The first change was switching from a telephony-optimized model variant to the general-purpose variant. The telephony-specific model wasn't providing meaningfully better accuracy for Hindi telephony audio, and we needed every millisecond.

The second set of changes was more granular. Most STT providers offer a battery of optional processing features: punctuation, filler word detection, smart formatting, language detection, profanity filtering, numeral conversion. We disabled all of them.

Each flag represents a post-processing step on the provider's side. Individually, each adds only single-digit milliseconds. Collectively, they add up. More importantly, none provide value in a voice agent pipeline. The LLM doesn't need punctuated input. Filler word detection is irrelevant when the LLM handles disfluencies natively. Smart formatting is wasted work. Language detection is overhead when you've already specified your target language.

Let the STT do the minimum: convert audio to text, fast. Your LLM is the understanding layer.

Later, we migrated to an STT provider purpose-built for Indian languages. The Hindi accuracy improvement was substantial. This is a reminder that the fastest STT isn't always the best – accuracy matters because STT errors compound through the LLM and into response quality, creating correction turns that add their own latency to the overall conversation.

Interruption Handling: The Balancing Act

Most voice agent frameworks give you two interruption knobs: duration-based (seconds of overlapping speech) and word-count (recognized words).

Duration-based was too aggressive for us. A few hundred milliseconds of audio is roughly one short Hindi word or a brief vocalization. It cut off the agent when users said "haan" or "achha" (the Hindi "uh-huh" and "okay") – sounds that indicate engagement, not an intent to take the floor.

We disabled duration and kept only word count (minimum 2 words). Word count is a better interrupt signal than duration for multilingual speakers. Two recognizable words means someone is genuinely trying to say something. A few hundred milliseconds of audio might just be breathing or vocalizing agreement.

The asymmetry matters: false-positive interruptions (agent stops when it shouldn't) are far more disruptive than false-negatives (agent briefly talks over the user). We accept occasional half-second overlaps in exchange for never cutting off the agent due to a stray "hmm."

Per-Turn Metrics: The Non-Negotiable

We built a metrics aggregator that tracks every component of every conversational turn: VAD inference time, EOU delay, STT duration, LLM wall-clock span and TTFT, TTS wall-clock span and TTFB.

Every turn produces a log line:

turn latencies: VAD=45ms, EOU(eos/trans)=120ms/890ms, STT=950ms,
  LLM=1200ms (TTFT=450ms), TTS=650ms (TTFB=120ms)

How to read this: the user experienced roughly 120ms (EOU end-of-speech) + 450ms (LLM TTFT) + 120ms (TTS TTFB) = ~690ms from the moment they stopped speaking to the moment they heard the agent start responding. The remaining times (890ms transcription delay, 1200ms total LLM, 650ms total TTS) represent work happening in the streaming pipeline that the user doesn't directly perceive as delay.

Rules of thumb from thousands of production turns:

LLM TTFT > 300ms consistently? Users perceive the agent as slow regardless of other metrics.
EOU transcription delay >> EOU end-of-speech delay? Your STT is the bottleneck, not turn detection.
TTS TTFB spikes above 200ms? Check your TTS provider's load -- this is often the canary for provider-side degradation.

The aggregator handles edge cases that naive metrics miss: multiple LLM steps per turn (function calls), multiple TTS segments per response, and incomplete metric sets from interrupted turns.

We wired this instrumentation on the very first day of latency work. Every optimization decision in this post was validated by these per-turn numbers. Without them, we would have been guessing.

The Tuning Treadmill

The real story of latency tuning is one that no architecture diagram can tell. Over a three-week period focused on turn detection and STT alone, we made five significant configuration changes. Two were reversals of earlier decisions. We declared "best config so far" twice, and both times it was superseded within days.

The sequence went roughly like this:

Disable ML turn detection, switch STT models, tighten endpointing, enable preemptive generation.
Strip all optional STT processing flags.
Disable preemptive generation (it increased latency), adjust endpointing further.
Revert STT flag overrides (defaults turned out to be fine), revert endpointing changes.
Re-enable ML turn detection with extremely permissive threshold, remove manual endpointing, disable duration-based interruptions.

This isn't indecisiveness. This is what systematic latency tuning looks like. Each change was measured, evaluated in production, and kept or rolled back based on per-turn metrics and real user feedback. The codebase carries the scars: a comment reading "ML turn detection causes high latency" sits directly above the line that enables it with a tuned threshold. The comment is historically accurate but currently misleading. We kept it as a warning – the default configuration does cause high latency, but the tuned configuration doesn't.

Noise Cancellation: We Turned It Off

Our real-time communication framework provides built-in noise cancellation optimized for telephony audio. We had it available and chose not to use it.

Three reasons. First, it adds processing latency to every audio frame. Second, it changes the audio energy profile that VAD models are trained on. Third, our STT handles noisy telephony audio natively – Indian phone networks are noisy by default, and any STT deployed for this use case handles noise as a baseline capability.

If your STT is good enough to handle your audio environment, noise cancellation is overhead. If it isn't, noise cancellation is a band-aid. Either way, it's not the right lever for a latency-constrained agent.

The Uncomfortable Truth

After three months of tuning, we don't have a magic configuration. We have a configuration that works well for Hindi/Hinglish telephony conversations with specific models for each pipeline component. Change any variable and the optimal configuration shifts.

What we have is a process: instrument everything, measure every turn, change one parameter at a time, and accept that you'll reverse yourself.

Latency in voice agents is not about any single component. It's about the orchestration between them.

A faster STT doesn't help if your turn detection is too conservative. A faster LLM doesn't help if your interruption handling forces users to repeat themselves. Preemptive generation – a pure win on paper – made things worse because of how our language mix interacted with partial transcription.

The only path forward: build the instrumentation first, then start turning knobs.

This post is part of our series on building production-grade voice AI agents. Read the other deep dives: Cost Optimization, Prompt Engineering for Voice, Testing Strategies, Analytics & Observability and Production Hardening.

Struggling with voice agent latency? Let's look at your metrics together - We can audit your pipeline configuration and per-turn metrics in a focused session. Identifying the specific bottlenecks and parameter interactions that are costing you responsiveness. We've been through the tuning treadmill across multiple deployments and can help you skip months of iteration.