Testing Voice Agents Without Going Insane

Part of the series: What It Takes to Build Production-Grade Voice AI Agents

TL;DR: Traditional software testing breaks down for voice agents because outputs are non-deterministic. We use LLM-as-judge evaluation with PromptFoo and frozen conversation snapshots to test intent rather than exact text. After thousands of production calls across multiple deployments, our most valuable tests came not from planning - they came from turning every real-world bug into a permanent regression test. But response correctness is only half the story. We also test STT/TTS provider quality before selection, benchmark end-to-end latency before every pipeline change, test RAG retrieval accuracy independently from the LLM, and validate spoken output quality for names, numbers, and addresses.

The caller said one thing. The agent heard something else entirely and hallucinated a different brand into the conversation - You shipped your voice agent. It sounded great in the demo. Then someone calls from a noisy truck stop and says a brand name in their regional accent. The agent confidently responds with a competitor's name.

This is not a contrived example. It happened to us in production. And the only reason we caught it was because we had a testing system that turned every production bug into a frozen, repeatable test case.

So here's the question your engineering team will hit early: how do you test a system whose outputs are different every time?

The Fundamental Problem

Traditional software testing rests on a simple assumption: given the same input, you get the same output. You assert that add(2, 3) returns 5, and if it ever returns 6, something's broken.

Voice agents violate this assumption completely.

Ask your agent the same question ten times and you'll get ten different responses. All might be perfectly correct. The agent might say "Achha, aapne [Brand X] liya hai" or "Bahut achhi choice hai" or simply "Samajh gayi, [Brand X]." All three are valid responses to a user identifying their purchase. An exact text match fails on two out of three.

You can't manually call your agent a hundred times after every prompt change. You can't write string equality assertions against non-deterministic outputs. And you can't rely on "it sounded fine when I tested it" as a quality gate for a system making thousands of real phone calls.

You need a way to test whether the agent did the right thing, not whether it said these exact words.

LLM-as-Judge: Testing Intent, Not Text

The solution: use an LLM as the evaluator. Instead of checking exact text matches, define what a correct response should accomplish, and let a capable model judge whether the agent's output satisfies that rubric.

We use PromptFoo as our testing framework. PromptFoo supports "llm-rubric" assertions – you write a natural language description of what a passing response looks like, and an evaluator LLM scores the agent's actual output against it.

A real assertion from our test suite:

assert:
  - type: llm-rubric
    value: "The response should ask for specific details such as
            brand/company and product identifier when the user has recently
            made a purchase"

This handles non-determinism naturally. Whether the agent says "Kaun si company ka product liya?" or "Achha, kaunsa brand tha?" or "Company aur model bata dijiye," all three pass. The evaluator understands that each response accomplishes the same conversational goal.

We also layer in deterministic assertions where they matter. For anti-hallucination checks, we use not-contains to ensure the agent doesn't invent information:

assert:
  - type: llm-rubric
    value: "When user mentions ONLY a brand name without any product identifier,
            the bot must NOT add or assume any specific identifier.
            The bot should acknowledge the brand and ask for details."
  - type: not-contains-all
    value: ["<known identifier A>", "<known identifier B>", "<known identifier C>"]
  - type: contains
    value: "model"

The combination of semantic rubric assertions and deterministic string checks gives us both flexibility and precision. The LLM rubric handles non-determinism. The string assertions catch specific failure modes we've seen before and never want to see again.

Test Architecture: Frozen Conversation Snapshots

The key architectural decision: test per-step, not end-to-end.

Our agents follow multi-step conversation flows: introduction, permission, primary intent, entity details, customer segment, feedback collection, gap analysis, and closing. Testing the entire flow end-to-end would mean running a full multi-turn conversation for every test case – slow, expensive, and imprecise when something fails.

Instead, each test case is a frozen JSON file representing a specific moment in the conversation. The file contains the conversation history up to the point we want to test, and we ask the agent to generate just the next response.

A test case for "user just confirmed a purchase":

[
    {
        "role": "assistant",
        "content": "Namaste, main [agent name] bol rahi hoon,
                    [company] ki taraf se. Kya main aapka
                    2 minute ka samay le sakti hoon?"
    },
    {
        "role": "user",
        "content": "Haan, theek hai"
    },
    {
        "role": "assistant",
        "content": "Dhanyawad! Kya aapne haal hi main koi
                    [product category] liya hai ya koi plan hai?"
    },
    {
        "role": "user",
        "content": "Haan, liya hai"
    }
]

The test config feeds this conversation history plus the system prompt to the LLM, captures the response, and evaluates it against the rubric. Each test runs independently. If entity detail tests fail but feedback tests pass, you know exactly where the regression is.

The PromptFoo template that ties it together:

prompts:
  - |
    {{ system_prompt }}

    Here is the conversation history:
    {{ conversation_history }}

    Your response:

providers:
  - id: <your-llm-provider>
    config:
      temperature: 0.5

tests:
  - vars:
      system_prompt: file://path/to/system_prompt.md
      conversation_history: file://path/to/test_case.json
    assert:
      - type: llm-rubric
        value: "The response should ask for specific details..."

This template is the entire test harness — one file to rule them all.

This gives us reproducibility (context is frozen), isolation (each step tested independently), and speed (no multi-turn orchestration needed).

What We Test and Why

Over multiple production campaigns, our test suite grew to over a dozen configuration files covering distinct behavioral categories:

Base cases. Happy-path flows covering every step of the conversation. These are the tests you write first and should always pass.

Abrupt ending cases. The single most critical behavioral distinction in our agent. When a user says "nahi lunga" (I won't buy it), that's valuable negative feedback – keep the conversation going. When a user says "bye" or "phone rakh do" (hang up), that's a signal to end the call. We test both directions roughly equally.

Production regression tests. Tests created directly from bugs found in real calls. More below – this became our most valuable test category.

Brand name transcription variations. A single brand name can get transcribed by the speech-to-text system in half a dozen different ways from regional Hindi speakers. We test each phonetic variant for every major brand.

Numeric identifier pronunciation. Hindi speakers pronounce multi-digit codes using a specific number-splitting convention. "Pachpan tees" means 5530, not "fifty-five thirty-three." We test that the agent acknowledges these correctly and doesn't hallucinate details when only partial information is given.

Anti-repetition. If the user already provided their brand three turns ago, the agent must not ask again. These tests verify information tracking across conversation history.

Per-step scenario coverage. Each major conversation step has its own test configuration covering distinct sub-scenarios. The entity detail step, for example, has cases for: both brand and identifier provided, only brand, only identifier, neither, and confused between multiple options.

Edge cases. Products described by physical characteristics rather than official names, out-of-scope categories, and domain-specific boundaries.

Production Regression Testing: Where the Real Value Lives

If we had to keep only one category of tests, we'd keep the production regression tests. Not the base cases, not the happy paths – the bugs.

The process: our analytics system records every call transcript. When a reviewer flags a problematic call, we extract the conversation history up to the exact moment the agent misbehaved, save it as a JSON file, and write an assertion describing what the agent should have done. That frozen moment becomes a permanent test case.

Real examples from our regression suite:

Brand A misidentified as Brand B. User said they chose one brand. The agent responded with a completely different one. The Hindi transcription was ambiguous, and the model hallucinated a brand name. Test assertion: agent must not confuse competitor brands.

Listening acknowledgment confused with confirmation. In Hindi, "haan" can mean "yes, I agree" or simply "uh-huh, I'm listening." The agent treated a listening acknowledgment as a confirmation and jumped ahead in the conversation flow.

Partial transcription triggering premature call exit. User was mid-sentence when the agent interpreted a partial transcription as a goodbye signal.

Hindi numeral misinterpretation. User said a product code using Hindi numerals (e.g., "pachpan tees" for 5530). The STT transcribed it ambiguously, and the agent interpreted it as an ordinal number.

Generic category treated as specific entity. User said they bought a generic product type. The agent treated it as a complete identification instead of asking for specifics.

Implicit goodbye not recognized. User said "baad mein baat karte hain" (let's talk later) – a polite Hindi ending the agent didn't recognize as a goodbye.

Each of these was weird in a way we'd never have invented in a synthetic test case. Real users produce edge cases that are stranger, more ambiguous, and more culturally specific than anything a test author sitting at a desk would dream up. After two campaigns, our production regression tests caught more regressions than the original hand-written test suite. The pattern is now baked into our workflow: flag a transcript, extract the problematic moment, freeze as JSON, write assertion, never regress.

Testing Across Model Migrations

One unexpected benefit: the test suite became essential during LLM version upgrades.

When upgrading from one mini-tier model to a newer version, we ran the full suite against the new model before deploying. The tests caught behavioral differences we wouldn't have noticed in manual testing – the newer model was stricter in some areas and more permissive in others. We adjusted generation parameters and prompt sections to account for how the new model interpreted instructions differently.

Your test suite becomes a regression harness for model version upgrades — not just prompt changes

We also experimented with using a smaller, cheaper model as the test evaluator (to reduce evaluation costs). That failed – the evaluator needs sufficient reasoning capability to judge whether a Hindi-English code-mixed response satisfies a nuanced behavioral rubric. Cheap evaluators produce cheap evaluations. We use a capable model for the judge.

The lesson: your test evaluator model matters as much as the model you're testing.

Beyond Response Correctness

Everything above tests what the agent says. But a voice agent has an entire pipeline before and after the LLM - STT transcription, TTS synthesis, RAG retrieval, latency across every component - and each needs its own testing discipline. Across multiple deployments, we've built testing systems for each of these layers.

STT and TTS Provider Evaluation

Before you test what the agent says, you need to know whether it hears correctly and speaks clearly. We evaluate STT and TTS providers with dedicated test suites before selection, and whenever we consider switching.

STT evaluation involves running real telephony audio - not clean test recordings - through candidate providers and measuring word error rate (WER) on domain-critical terms. Brand names, product codes, personal names, and addresses are the terms that matter most, and they're exactly the terms that differ most between providers. One provider transcribed a common brand name six ways depending on background noise level. Another consistently mistranscribed a vehicle category term that users say dozens of times per campaign. These differences don't show up in generic WER benchmarks - they only emerge when you test with audio from your actual use case.

TTS pronunciation testing covers the output side: does the synthesized speech correctly pronounce phone numbers, email addresses, personal names, and domain-specific terms? We maintain a test suite of structured data readbacks - phone numbers in digit-spaced form, customer names, vehicle model identifiers, service center addresses - and verify each TTS candidate handles them correctly. A TTS engine that sounds natural on general text but mispronounces a customer's name or reads a product code as an ordinal number is unusable regardless of voice quality.

These evaluations run before provider selection and again whenever a provider releases a new model version. The TTS test suite in particular has caught regressions that provider changelogs didn't mention.

End-to-End Latency Benchmarking

We now run every pipeline change through a standardised latency benchmarking harness before deploying to production. The harness operates on a single principle: change one knob at a time, measure per-turn latency, compare against the baseline.

Each test is an env-file containing only the configuration deltas from the baseline - a different STT provider, a different endpointing threshold, noise cancellation on vs. off. A shell script applies the configuration, another captures per-turn metrics during a live test call (EOU delay, LLM TTFT, TTS TTFB, cache hit rate), and a summary script compiles all results into a comparison table.

We've run 50+ test configurations this way, across multiple STT providers, multiple languages, and multiple turn detection strategies.

The results consistently show that latency optimizations that look good on paper can backfire in practice - a faster STT provider that introduces more transcription errors creates correction turns that add more latency than the faster transcription saves. The only way to know is to measure the full pipeline end-to-end.

The harness also handles multi-tester isolation. Each engineer gets a tagged logger prefix, so multiple people can benchmark different configurations concurrently against the same deployment without contaminating each other's results.

RAG Pipeline Testing

When your agent retrieves context from a knowledge base, the retriever needs its own test suite - separate from the LLM correctness tests.

We test the RAG pipeline at two levels. First, retriever-only evaluation: given a user query, does the search return relevant documents with acceptable confidence scores? We run the retriever in isolation, without the LLM, and check that the right documents surface and that confidence scores exceed minimum thresholds. This catches retrieval regressions that would be invisible in end-to-end LLM tests - the LLM might compensate for a bad retrieval result by hallucinating a plausible answer, which passes an LLM rubric but is factually wrong.

Second, full pipeline evaluation: query goes through retrieval, retrieved context feeds the LLM, and the response is checked against expected answer criteria, expected keywords, and fail keywords. A real test case: the query "Is service free?" was returning "service is recommended" when the correct answer from the knowledge base is "free service is included in your warranty." The test encodes both expected keywords ("free", "warranty", "included") and fail keywords ("recommended", "optional", "paid") - the LLM must get the answer right and must not soften it.

We also test out-of-scope boundary detection: queries like "mujhe pizza order karna hai" (I want to order pizza) should return low confidence scores and trigger the agent to decline helpfully rather than hallucinate an answer from training data.

Without boundary tests, a RAG agent is one ambiguous query away from confidently answering a question it has no business answering.

Voice Testing with Real-World Audio

Synthetic test fixtures - typed text simulating what a user might say - only go so far. Real-world telephony audio produces STT output that's fundamentally different from anything you'd type into a test case. Carrier codec compression, regional accents, background noise, mid-sentence language switching, and crosstalk all create transcription artifacts that synthetic tests can't replicate.

We place real test calls through the actual telephony pipeline as part of our evaluation workflow. These aren't automated - someone calls the agent, follows a scripted conversation with natural speech, and the system captures per-turn metrics and transcripts. The value isn't in automation; it's in exposing the gap between how you think users speak and how they actually sound on a phone line.

Frameworks like Hamming and Cekura attempt to bridge this gap with simulated callers, and they're useful for catching structural issues - tool routing errors, conversation flow breaks, timeout handling. But they can't replicate the phonetic unpredictability of a real person speaking Hinglish with a regional accent from a noisy truck depot. Both synthetic and live testing have their place; neither alone is sufficient.

What We Learned

Nine principles from building and maintaining testing systems across multiple voice agent deployments:

Test per-step, not end-to-end. Isolating each conversation step makes tests faster, easier to debug, and more precise when they fail.
Freeze conversation context as JSON files. Reproducibility is everything. A frozen JSON file produces the same test conditions every time.
LLM rubrics over exact text matching. Non-determinism is a feature of language models, not a bug. Test what the response accomplishes, not what it says.
Production bugs beat synthetic test cases. The weird, culturally specific, ambiguous edge cases from real calls are more valuable than any test you'll write at your desk. Build a pipeline that turns every bug into a test case.
Test anti-hallucination rules specifically. Dedicated test configurations that verify the agent doesn't invent information. Hallucination in voice is audibly wrong – callers notice immediately.
Test phonetic variations of domain terms. If your STT produces variable transcriptions, test every known variant of every important term.
Run tests before AND after prompt changes. Before to establish the baseline. After to verify no regressions. This sounds obvious until you're iterating quickly and the temptation to "just ship it" is strong.
Test each pipeline layer independently. STT accuracy, RAG retrieval quality, LLM response correctness, TTS pronunciation, and end-to-end latency are five different testing problems. Conflating them means you can't tell which layer is broken when something fails.
Gate deployments on latency benchmarks. Every pipeline change - provider swap, model upgrade, configuration tweak - runs through a standardized latency harness before reaching production. The harness caught regressions that looked like improvements on paper.

What We Don't Test (Yet)

Honesty about what we haven't solved yet matters — here's the list:

Automated end-to-end voice regression. Our live call testing is manual and scripted. We don't yet have an automated system that places a call, speaks with natural audio, captures the transcript, and fails a CI pipeline if the response regresses. The infrastructure exists (simulated callers, telephony APIs), but wiring it into a reliable automated pipeline that handles the inherent variability of voice is a non-trivial problem we haven't prioritized.

Multi-turn coherence. Our per-step approach tests each step in isolation but doesn't catch subtle coherence issues that emerge only over many turns.

Load testing. No automated tests for concurrent call load and resource contention effects on quality.

Each is a real gap we'll address as the system matures.

The Takeaway

Voice agent testing is not traditional software testing with a voice skin. The non-determinism of language models, combined with the variability of real-world speech transcription, creates a testing problem that exact-match assertions cannot solve.

The approach that works: LLM-as-judge evaluation for semantic correctness, deterministic assertions for known failure modes, frozen conversation snapshots for reproducibility, a disciplined pipeline that turns every production bug into a permanent regression test, provider evaluation suites for STT/TTS quality, end-to-end latency benchmarking for every pipeline change, and independent RAG testing for retrieval accuracy.

After multiple production campaigns and thousands of calls across deployments in multiple languages, our test suites have grown to hundreds of test cases across dozens of configurations. The production regression tests - born from real bugs, not imagined scenarios - remain the most valuable. They encode hard-won knowledge about how real users actually talk to voice agents, in all their ambiguous, multilingual, culturally specific glory.

The agents keep getting better. Not because we write better prompts (though we do), but because we never let the same bug happen twice.

This post is part of our series on building production-grade voice AI agents. Read the other deep dives: Latency Tuning, Cost Optimization, Prompt Engineering for Voice, Analytics & Observability, and Production Hardening.

Building a voice agent and wondering how to test it? Let's design your test suite - We can help you set up evaluation pipelines for every layer - LLM response correctness, STT/TTS provider selection, RAG retrieval quality, and end-to-end latency benchmarking - tailored to your domain, language, and conversation flow.