ADR: Multilingual Agent Strategy¶
IMPLEMENTED — v3.0 Dual-Track Architecture
The dual-track bilingual strategy is live in production. Router detects language from first utterance and routes to EN or ZH track. See Squad Architecture.
Date: January 2026 (v1.0), Updated February 2026 (v3.0) Status: Updated Decision: v1.0-v2.3.0: ONE multilingual agent per clinic with auto-detection. v3.0: Dual-track with explicit language gate.
Context¶
VitaraVox v1.0 supports English and Mandarin for BC healthcare clinics. The question: should we use separate agents per language or a single multilingual agent?
Decision¶
Use ONE multilingual Vapi.ai assistant per clinic with Deepgram Nova-2 "multi" mode for automatic language detection.
Architecture¶
+------------------------------------------------------------------+
| |
| Patient Call Flow |
| ================ |
| |
| Patient dials clinic number |
| | |
| v |
| +---------------+ |
| | Telnyx | <-- Receives call, routes to Vapi.ai |
| +-------+-------+ |
| | |
| v |
| +------------------------------------------------------------------+
| | |
| | SINGLE MULTILINGUAL VAPI.AI ASSISTANT |
| | |
| | +----------------------------------------------------------+ |
| | | Transcriber: Deepgram Nova-2 (language: "multi") | |
| | | --> Auto-detects: English, Mandarin (v1.0) | |
| | | --> Seamless mid-conversation switching | |
| | +----------------------------------------------------------+ |
| | |
| | +----------------------------------------------------------+ |
| | | LLM: GPT-4o (streaming enabled) | |
| | | --> Multilingual system prompt | |
| | | --> Tool calls to OSCAR EMR | |
| | +----------------------------------------------------------+ |
| | |
| | +----------------------------------------------------------+ |
| | | Voice: Azure TTS (multilingual-auto) | |
| | | --> en-US-AriaNeural (English) | |
| | | --> zh-CN-XiaoxiaoNeural (Mandarin) | |
| | +----------------------------------------------------------+ |
| | |
| +------------------------------------------------------------------+
| | |
| v Tool Calls (real-time during conversation) |
| +---------------+ |
| | OSCAR | <-- check_availability, book_appointment |
| | EMR | |
| +---------------+ |
| |
+------------------------------------------------------------------+
Comparison¶
| Factor | Single Multilingual | Multiple Language Agents |
|---|---|---|
| Agents per clinic | 1 | 5 (one per language) |
| Total for 5 clinics | 5 | 25 |
| Configuration burden | Low | 5x higher |
| Language switching | Seamless | Requires transfer (200-500ms) |
| Mid-conversation mix | Supported | Complex routing needed |
| IVR complexity | None (auto-detect) | "Press 1 for English..." |
| Maintenance | Single prompt | 5 prompts to sync |
Rationale¶
1. Patient Demographics in BC¶
Many BC healthcare patients are bilingual (English/Mandarin). They may:
- Start in one language, switch mid-sentence
- Use English for medical terms, Mandarin for personal details
- Have family members on the call speaking different languages
Single multilingual agent handles this naturally.
2. Latency¶
Target: <800ms end-to-end voice response
| Approach | Latency Impact |
|---|---|
| Single agent | Optimal streaming, no transfers |
| Multiple agents | +200-500ms per transfer/handoff |
3. Configuration¶
v1.0 uses manual Vapi.ai setup:
| Approach | Setup Time (5 clinics) |
|---|---|
| Single multilingual | 5 assistants = ~2 hours |
| Per-language agents | 25 assistants = ~10 hours |
4. User Experience¶
Single Multilingual Agent:
Patient: "Hello, I'd like to book an appointment"
Agent: "Hello! I'd be happy to help you book an appointment..."
Patient: "其实我想用中文" (Actually I want to use Chinese)
Agent: "没问题!请问您想预约什么时间?" (No problem! When would you like to book?)
Multiple Language Agents:
IVR: "Press 1 for English, 2 for 中文..."
Patient presses 1
English Agent: "Hello! How can I help?"
Patient: "Actually, can I switch to Chinese?"
Agent: "Let me transfer you..."
[200-500ms silence]
Chinese Agent: "您好,请问有什么可以帮您的?"
Patient: [repeats request]
Vapi.ai Configuration¶
Transcriber (Deepgram)¶
{
"transcriber": {
"provider": "deepgram",
"model": "nova-2",
"language": "multi",
"smartFormat": true
}
}
Voice (Azure TTS)¶
{
"voice": {
"provider": "azure",
"voiceId": "en-US-AriaNeural",
"multilingualSettings": {
"enabled": true,
"fallbackVoices": {
"zh-CN": "zh-CN-XiaoxiaoNeural"
}
}
}
}
Language Roadmap¶
| Version | Languages | Approach |
|---|---|---|
| v1.0 | English, Mandarin | Single multilingual agent, auto-detect |
| v2.0-v2.3.0 | EN, ZH, FR, PA | Single multilingual agent, auto-detect |
| v3.0 | English, Mandarin | Dual-track with explicit language gate |
| Future | + French, Cantonese, Punjabi | Additional tracks (FR track, etc.) |
v3.0 Update: Dual-Track Architecture¶
Why the Change¶
Production experience with v2.3.0 revealed limitations of the single-multilingual-agent approach:
- GPT-4o language confusion: The LLM occasionally output space-separated Chinese characters or mixed language fragments
- STT accuracy: Deepgram nova-2
multimode sacrifices per-language accuracy for breadth - TTS quality: ElevenLabs
eleven_multilingual_v2produces adequate but not native-quality Mandarin - Prompt bloat: Bilingual prompts are 40-60% longer than monolingual equivalents
v3.0 Decision¶
Use explicit language gate with per-language agent tracks:
+--------------------------------------------------+
| ROUTER (AssemblyAI Universal = bilingual STT) |
| - Detect language from first utterance |
| - Route to EN or ZH track |
+---------------------+----------------------------+
|
+-----------+-----------+
v v
EN TRACK (4 agents) ZH TRACK (4 agents)
STT: Deepgram en STT: Deepgram zh
TTS: ElevenLabs TTS: Azure Xiaoxiao
v3.0 vs v1.0 Comparison¶
| Factor | v1.0 Single Agent | v3.0 Dual-Track |
|---|---|---|
| Agents per clinic | 1 | 9 |
| Language detection | Auto (Deepgram multi) | Explicit (Router gate) |
| Mid-call switching | Seamless | Stays in initial track |
| STT accuracy (per language) | Good | Better (dedicated model) |
| TTS quality (Mandarin) | Adequate | Native (Azure) |
| Prompt complexity | High (bilingual) | Low (monolingual) |
| LLM language confusion | Occasional | Eliminated |
| Configuration burden | Low | Higher (9 agents, GitOps manages it) |
Trade-off: Mid-Call Language Switching¶
The main drawback of v3.0 is losing seamless mid-call language switching. If a patient starts in English and switches to Mandarin mid-conversation, v3.0 stays in the English track.
Mitigation: In BC demographics, most callers have a clear primary language preference. True bilingual callers who switch mid-call are a minority. The Router can re-route on subsequent calls if the patient calls back.
Consequences¶
v1.0-v2.3.0¶
Positive:
- Simpler architecture (fewer moving parts)
- Better patient experience (no IVR, seamless switching)
- Lower operational overhead
- Faster onboarding for new clinics
Negative:
- Single point of failure per clinic (mitigated by Vapi.ai reliability)
- System prompt complexity (one prompt handles all languages)
- Testing matrix grows with each language added
v3.0 (Additional)¶
Positive:
- Higher per-language STT accuracy
- Native-quality Mandarin TTS
- Simpler, shorter monolingual prompts
- Eliminates LLM language confusion
- Consistent voice personality per language track
Negative:
- 9x agents per clinic (mitigated by Vapi GitOps)
- No mid-call language switching
- Two parallel prompt sets to maintain (EN + ZH)
- Adding a new language requires a full new track (4 agents)