Prompt Engineering for Multilingual Voice Agents

Part of the series: What It Takes to Build Production-Grade Voice AI Agents

TL;DR: Voice prompt engineering is fundamentally different from chat. Our production multilingual voice agents' system prompts total 2,700+ lines across multiple personas – every line earned through a real call failure. The key patterns: treat conversation flow as a state machine (not a script), enumerate edge cases explicitly instead of relying on general rules, build extensive domain vocabulary into the prompt, and repeat critical anti-hallucination rules at every decision point. General instructions work ~80% of the time; in voice, the other 20% destroys user trust faster than you can recover it. We chose prompt engineering over fine-tuning deliberately, and we'd make that choice again.

If you're building voice AI, the LLM is the easy part – The hard part is telling it what to do.

In our overview post, we noted that our system prompts grew to 2,700+ lines. This post explains why, and walks through the specific patterns that got us from a 200-line prompt that worked in demos to a production system that handles thousands of real calls across India in Hindi and Hinglish.

The patterns generalize to any voice agent, but the specifics of multilingual and domain-heavy conversations are where prompt engineering gets genuinely hard.

Why Voice Is 10x Harder Than Chat

If you've shipped chat-based LLM applications, you might assume the prompt engineering carries over. It doesn't.

Voice changes the problem in fundamental ways: there's no visual formatting, no "scroll up" to check what was said, no time to re-read a response. The user hears your agent once, in real time, and judges in seconds.

But there's a deeper difference that most teams miss. With a general-purpose chatbot or assistant, the user chose to use it. They saw the "AI can make mistakes" disclaimer. They opted in. If the model hallucinates, they might shrug – they knew the deal. Production voice agents don't get that luxury.

In an outbound call, the user picked up the phone expecting a professional interaction from a brand they know. In an inbound call, they called a business expecting competent service. Neither consented to being on the receiving end of AI hallucinations. There is no disclaimer. The agent is representing your company to someone whose tolerance for being confidently told something wrong is zero.

That context shapes everything below.

Three problems that don't exist in chat:

Hallucination is audibly destructive. When a user mentions a product and the LLM helpfully fills in the brand name they never said, they hear it immediately and lose trust. In chat, a user might skim past an inaccuracy. In voice, it's jarring. We drove hallucination rates from ~15-20% of calls to near zero – but it took approximately 500 lines of layered anti-hallucination rules.

Goodbye vs. negative feedback. An LLM's instinct is to wrap up when it hears negativity. But negative feedback is often the most valuable data you're collecting. Teaching the model the difference between "I'll never buy this again" (keep talking -- this is gold) and "I don't have time" (end the call politely) nearly derailed our first production campaign.

Domain vocabulary is infrastructure, not decoration. A single brand name can have six phonetic spellings in Hindi STT output. Numeric product codes follow specific Hindi pronunciation conventions. Without 200+ lines of domain terminology in the prompt, the agent can't understand a significant portion of what users say.

Why Prompts Over Fine-Tuning

We chose prompt engineering over fine-tuning deliberately:

Speed of iteration. Prompts deploy in minutes. Fine-tuning takes days. When you discover a production failure at 2 PM, you want it fixed by 3 PM, not next week.

Precision of rules. Prompts express exact rules. Fine-tuned models learn patterns. When you need "never add a brand name the user didn't say," you want a rule, not a pattern.

Auditability. Every prompt line traces to a specific production failure. With fine-tuning, behavioral changes are opaque.

Prompt caching economics. Most LLM providers cache repeated prompt prefixes at ~50% discount. With large system prompts identical across turns, roughly two-thirds of input tokens get served from cache. This makes the token cost of even 1,000+ line prompts manageable.

Fine-tuning is on our roadmap once we've accumulated enough production transcripts for training data. But for getting to production quickly and iterating once there, prompts win.

The Scale of Production Prompts

We started with 200 lines. It worked beautifully in demos. Then we deployed.

A user mentioned a product identifier and the agent added the brand name – something the user never said. A user expressed dissatisfaction and the agent ended the call, interpreting negativity as goodbye. A user spoke a numeric code using Hindi number-splitting conventions and the agent heard it as an ordinal number.

Each failure added lines. Not because we wanted a longer prompt, but because LLMs require explicit instruction for edge cases that humans handle through cultural context.

Our production deployment now involves multiple distinct agent personas:

Agent Type	Purpose	Temperature
Open-ended	Qualitative feedback through exploratory conversation	~0.5
Diagnostic	Specific reasons behind a decision or outcome	~0.3
Transactional	Structured information delivery, confirm next steps	~0.3

The temperature differences are deliberate. Open-ended agents need enough creativity to respond naturally to unpredictable inputs. Transactional agents need precision and tight script adherence. This isn't a one-size-fits-all parameter.

The primary feedback agent runs at roughly 1,200 lines. A diagnostic agent comes in at 630 lines. A transactional agent at 350 lines. Combined with shared domain recognition guides, anti-hallucination rules, and vocabulary references, the total prompt infrastructure exceeds 2,700 lines across the system.

State Machine, Not Script

The most common mistake in voice agent design is treating conversations as linear scripts. Real conversations don't work that way. Users answer out of order, provide partial information, go off-topic, circle back, and give multi-part answers spanning multiple turns.

Our agents implement multi-flag state machines. At any point, the agent tracks collected information categories: primary intent, entity name, entity identifier, customer segment, permission status, feedback category, and closing intent. Before asking any question, the agent runs a mandatory search protocol against the conversation history – scanning all previous turns for keywords that indicate information was already provided.

SEARCH ALL USER TURNS FOR:
Keywords: [known entities in the domain]

IF FOUND:
- entity_collected = YES
- RESULT: DO NOT ask for entity

IF NOT FOUND:
- entity_collected = NO
- RESULT: May ask for entity (if not already asked)

This search runs every time the agent prepares a response. It's verbose. It adds dozens of prompt lines. But without it, the agent re-asks questions the user already answered – and in voice, that's an instant credibility killer. Users hang up when they feel unheard.

The state machine also enforces strict step ordering with explicit validation checkpoints – entry requirements that must be verified before proceeding, with error-check instructions that redirect the agent back if a step was skipped. This seems heavy-handed, but mini-tier LLMs will skip steps if you let them. They optimize for efficiency. In structured data collection, skipping a step means losing data.

Anti-Hallucination: Layered Defense

Our most persistent problem: entity name insertion.

When a user mentions a product identifier without specifying the brand, the LLM wants to helpfully fill in the brand name from its training data. But the user never said it. Maybe they bought a competitor's product. We can't know, and the agent must not assume.

We addressed this with a zero-tolerance policy expressed through multiple layers:

Explicit negative examples for every major entity in the domain. Not just "don't hallucinate" as a general rule, but specific patterns:

If user says "[identifier] liya hai" -> DO NOT add [Brand]
-> User NEVER mentioned [Brand]

This pattern is repeated for every major entity-brand combination in the domain.

A domain recognition guide with its own warning. The guide helps the agent understand what users say, but explicitly instructs that it does NOT give permission to assume missing information:

CRITICAL WARNING - DO NOT USE THIS GUIDE TO MAKE ASSUMPTIONS:
- This guide helps you UNDERSTAND what users say
- This guide does NOT give you permission to ASSUME missing information
- If a user mentions only an identifier -> ASK for the brand, do not infer it

Two-part validation logic directly in the prompt. Before proceeding, the agent checks: did the user mention the brand? Did they mention the identifier? If either is missing, ask for it.

Repetition at decision points. The anti-hallucination policy appears at least four times in different formulations – at the top as general rules, as specific examples, as validation checklists, and as pre-check instructions before responses. LLMs process long contexts imperfectly, and critical rules need reinforcement where they're most likely to be violated.

After implementing this layered approach, hallucination rates dropped from ~15-20% to near zero. The key insight: a single general rule ("don't hallucinate") works 80% of the time. The other 20% requires explicit enumeration at every decision point.

The Goodbye Problem

This nearly derailed our first campaign.

Negative feedback – complaints, dissatisfaction, reasons for choosing a competitor – is the entire point of our feedback calls. But LLMs have a strong bias toward politeness and conversation termination. When a user expresses negativity, the model's instinct is to wrap up.

The problem is particularly subtle in Hindi/Hinglish:

Not goodbye (valuable feedback to explore): "nahi lunga" (won't buy it), "dobara nahi lenge" (won't buy again), "spare parts nahi milte" (can't get spare parts), "bekar hai" (it's useless), "satisfied nahi" (not satisfied).

Actual goodbye (end the call): "time nahi hai" (don't have time), "baad mein" (later), "disconnect karo" (disconnect the call), "baat nahi karni" (don't want to talk).

The distinction is cultural, not linguistic. "Nahi lunga" is a statement about a product – exactly the signal the business needs. But an LLM trained primarily on English conversation patterns reads sustained negativity as a cue to disengage.

We solved this through explicit enumeration – listing every known negative feedback phrase as "NOT goodbye" and every actual goodbye phrase as a termination signal. We'd tried a general rule first ("don't end the call based on negative feedback"). It worked 80% of the time. The other 20% cost us calls where users were explaining exactly why they chose a competitor – the most valuable data – and the agent thanked them and hung up.

Enumeration over inference. It's verbose, requires maintenance as new phrases emerge from transcripts, and is the only approach that reliably works.

Domain Vocabulary as Infrastructure

Users in specialized industries speak a rich vernacular that no general-purpose LLM understands. This isn't just about language – it's an entire occupational vocabulary with no English equivalent, phonetic ambiguities that break standard STT, and number systems that follow industry conventions.

Our prompts include 200+ lines of domain vocabulary:

Brand name variations. A single manufacturer might be referenced six different ways in spoken Hindi – regional accents produce different vowel placements, consonant substitutions, and transliteration variants. Without explicit mapping, our STT providers transcribe one variant, the LLM fails to recognize it, and the conversation derails.

Numeric identifier pronunciations. Multi-digit product codes follow a specific Hindi convention – split into digit pairs, each pronounced as a complete Hindi number. This creates critical disambiguation challenges:

"pachpan tees" (55-30) = 5530  -- product code
NOT "panchvan" (fifth)          -- ordinal number

"pachas tees" (50-30) = 5030   -- DIFFERENT product code
"atthais chhabbis" (28-26) = 2826  -- yet another code

Note how "pachpan tees" (5530) and "pachas tees" (5030) sound nearly identical to untrained ears but refer to completely different products. We include a complete Hindi number vocabulary from 0-99, with phonetic spellings, specifically to prevent these misinterpretations.

Industry slang. Every domain has informal terminology with specific meanings. In our case, operators use terms like "ghoda" (literally "horse") to mean truck, "chakka" for wheel count – so "chhakka" is a 6-wheeler, "dassi" is a 10-wheeler. "Maal" means cargo. "Average" means fuel efficiency.

Without this vocabulary, the agent can't parse domain-specific utterances. With it, the agent processes "Chhakka ghoda liya hai, average achhi nahi hai" (I bought a 6-wheeler truck, the fuel efficiency isn't good) without missing a beat.

Vehicle specification disambiguation. Users describe vehicles by body length ("22 footer"), seating capacity ("17 seater"), or wheel count. These are NOT model numbers, but the agent needs to understand them as vehicle attributes. The prompt explicitly teaches the difference:

"Maine 22 footer truck liya hai" = User bought a truck with 22-foot body length
"Mujhe 17 seater chahiye" = User wants a 17-seater vehicle

This vocabulary serves a dual purpose. It helps the LLM understand inputs, and it constrains outputs. The agent speaks professional Hinglish while understanding colloquial Hindi. That asymmetry is intentional – accessible to users while maintaining professionalism.

Context Injection: Pre-filling What You Know

Every outbound call starts with information from the CRM – customer name, product details, account history, relevant dates. Injecting this into the prompt shortens calls (no asking questions you know the answers to) and reduces hallucination (verified facts anchor the conversation).

We chose direct prompt injection over RAG deliberately. The context is structured (key-value pairs, not documents), known before the call starts (not discovered during it), and fits comfortably in the prompt. RAG would add retrieval latency to every turn in a system where the total response budget is under two seconds – for data we already have in a database row. Direct injection is simpler, faster, and more cache-friendly.

Our prompts include a placeholder populated at call time with structured customer data. The prompt provides detailed instructions for using it – reference actual dates rather than vague timeframes, adapt the conversational opening based on status fields.

When no customer data is available, the system injects a fallback instruction: use generic greetings, don't reference specific details. This prevents the agent from hallucinating customer information it doesn't have.

Context injection also handles the multi-product problem. When a customer says "Which one?" indicating multiple products or accounts, the agent references the specific identifier from injected context instead of stumbling.

Voice-Specific Prompt Patterns

Beyond domain challenges, voice as a medium demands different patterns:

Keep responses short. Chat users skim. Voice users can't. Every unnecessary word adds seconds and cognitive load. Our prompts instruct: don't monotonously acknowledge each response, don't repeat information back, capture information silently and move forward.

One question per turn. In chat, multiple questions work. In voice, users answer one and forget the other. Single-question turns with explicit routing logic.

Handle interruptions gracefully. Don't restart the entire question, don't skip ahead, complete the thought briefly.

Acknowledge listening sounds correctly. Hindi conversation includes constant backchannel signals – "hmm", "haan", "achha." These are listening sounds, NOT answers. The prompt teaches the agent to wait for complete responses.

TTS pronunciation awareness. Account for how text renders in speech. Ordinal numbers like "1st" get mispronounced – use Hindi words like "Pehli" instead. Numeric identifiers must follow industry pronunciation conventions, not English number reading.

Negation detection in Hindi. "Nahi" doesn't always mean refusal. "Nahi, main bata raha hoon..." means "No, I'm telling you..." – the user is explaining, not refusing. The prompt includes explicit pattern matching for contextual negation.

What We Learned

After shipping through multiple production campaigns and thousands of calls:

Enumerate, don't generalize. When you need the LLM to make a distinction, list every specific case. General rules work 80% of the time. In voice, that remaining 20% destroys trust.

Repeat critical rules. Anti-hallucination rules appear at least four times in different formulations. LLMs process long contexts imperfectly. Critical rules need reinforcement at the points where they're most likely to be violated.

Match temperature to task. Open-ended conversations need moderate temperature. Transactional agents need lower temperature. The right value depends on creative latitude needed.

Invest in vocabulary. Domain-specific vocabulary is core infrastructure. 200+ lines of terminology, number guides, and slang dictionaries are what make the agent functional in its domain. Without them, the agent is a Hindi speaker who doesn't understand the people it's talking to.

Design for the medium. Voice prompts must account for TTS pronunciation, interruption handling, backchannel signals, and the constraint that users hear output exactly once.

Prompt engineering is not a one-time activity. Our prompts evolve with every campaign. We review transcripts, identify failures, add targeted instructions. The prompt files accumulate numbered fix annotations – each referencing a specific production failure. This is how production voice prompts work.

The prompt is not just instructions. For a voice agent operating in a specialized domain, in a language with complex code-switching patterns, speaking to users who judge it in five seconds – the prompt is the product.

It encodes business logic, cultural knowledge, domain expertise, and conversational design. Getting it right is the hardest part of building a voice agent, and there are no shortcuts.

This post is part of our series on building production-grade voice AI agents. Read the other deep dives: Latency Tuning, Cost Optimization, Testing Strategies, Analytics & Observability and Production Hardening.

Wrestling with voice agent prompts? Let's talk prompt strategy - We've written 2,700+ lines of production voice prompt across multiple agent personas and languages. We can help you design your prompt architecture – state machine structure, anti-hallucination layers, domain vocabulary, and the testing pipeline that catches regressions before they reach production.