ADR: Multilingual Agent Strategy¶

Date: January 2026 (v1.0), Updated February 2026 (v3.0) Status: Updated Decision: v1.0-v2.3.0: ONE multilingual agent per clinic with auto-detection. v3.0: Dual-track with explicit language gate.

Context¶

VitaraVox v1.0 supports English and Mandarin for BC healthcare clinics. The question: should we use separate agents per language or a single multilingual agent?

Decision¶

Use ONE multilingual Vapi.ai assistant per clinic with Deepgram Nova-2 "multi" mode for automatic language detection.

Architecture¶

+------------------------------------------------------------------+
|                                                                  |
|   Patient Call Flow                                              |
|   ================                                               |
|                                                                  |
|   Patient dials clinic number                                    |
|           |                                                      |
|           v                                                      |
|   +---------------+                                              |
|   |    Telnyx     |   <-- Receives call, routes to Vapi.ai       |
|   +-------+-------+                                              |
|           |                                                      |
|           v                                                      |
|   +------------------------------------------------------------------+
|   |                                                                  |
|   |          SINGLE MULTILINGUAL VAPI.AI ASSISTANT                   |
|   |                                                                  |
|   |   +----------------------------------------------------------+   |
|   |   |  Transcriber: Deepgram Nova-2 (language: "multi")        |   |
|   |   |  --> Auto-detects: English, Mandarin (v1.0)              |   |
|   |   |  --> Seamless mid-conversation switching                 |   |
|   |   +----------------------------------------------------------+   |
|   |                                                                  |
|   |   +----------------------------------------------------------+   |
|   |   |  LLM: GPT-4o (streaming enabled)                         |   |
|   |   |  --> Multilingual system prompt                          |   |
|   |   |  --> Tool calls to OSCAR EMR                             |   |
|   |   +----------------------------------------------------------+   |
|   |                                                                  |
|   |   +----------------------------------------------------------+   |
|   |   |  Voice: Azure TTS (multilingual-auto)                    |   |
|   |   |  --> en-US-AriaNeural (English)                          |   |
|   |   |  --> zh-CN-XiaoxiaoNeural (Mandarin)                     |   |
|   |   +----------------------------------------------------------+   |
|   |                                                                  |
|   +------------------------------------------------------------------+
|           |                                                      |
|           v  Tool Calls (real-time during conversation)          |
|   +---------------+                                              |
|   |    OSCAR      |   <-- check_availability, book_appointment   |
|   |     EMR       |                                              |
|   +---------------+                                              |
|                                                                  |
+------------------------------------------------------------------+

Comparison¶

Factor	Single Multilingual	Multiple Language Agents
Agents per clinic	1	5 (one per language)
Total for 5 clinics	5	25
Configuration burden	Low	5x higher
Language switching	Seamless	Requires transfer (200-500ms)
Mid-conversation mix	Supported	Complex routing needed
IVR complexity	None (auto-detect)	"Press 1 for English..."
Maintenance	Single prompt	5 prompts to sync

Rationale¶

1. Patient Demographics in BC¶

Many BC healthcare patients are bilingual (English/Mandarin). They may:

Start in one language, switch mid-sentence
Use English for medical terms, Mandarin for personal details
Have family members on the call speaking different languages

Single multilingual agent handles this naturally.

2. Latency¶

Target: <800ms end-to-end voice response

Approach	Latency Impact
Single agent	Optimal streaming, no transfers
Multiple agents	+200-500ms per transfer/handoff

3. Configuration¶

v1.0 uses manual Vapi.ai setup:

Approach	Setup Time (5 clinics)
Single multilingual	5 assistants = ~2 hours
Per-language agents	25 assistants = ~10 hours

4. User Experience¶

Single Multilingual Agent:

Patient: "Hello, I'd like to book an appointment"
Agent: "Hello! I'd be happy to help you book an appointment..."
Patient: "其实我想用中文" (Actually I want to use Chinese)
Agent: "没问题！请问您想预约什么时间？" (No problem! When would you like to book?)

Multiple Language Agents:

IVR: "Press 1 for English, 2 for 中文..."
Patient presses 1
English Agent: "Hello! How can I help?"
Patient: "Actually, can I switch to Chinese?"
Agent: "Let me transfer you..."
[200-500ms silence]
Chinese Agent: "您好，请问有什么可以帮您的？"
Patient: [repeats request]

Vapi.ai Configuration¶

Transcriber (Deepgram)¶

{
  "transcriber": {
    "provider": "deepgram",
    "model": "nova-2",
    "language": "multi",
    "smartFormat": true
  }
}

Voice (Azure TTS)¶

{
  "voice": {
    "provider": "azure",
    "voiceId": "en-US-AriaNeural",
    "multilingualSettings": {
      "enabled": true,
      "fallbackVoices": {
        "zh-CN": "zh-CN-XiaoxiaoNeural"
      }
    }
  }
}

Language Roadmap¶

Version	Languages	Approach
v1.0	English, Mandarin	Single multilingual agent, auto-detect
v2.0-v2.3.0	EN, ZH, FR, PA	Single multilingual agent, auto-detect
v3.0	English, Mandarin	Dual-track with explicit language gate
Future	+ French, Cantonese, Punjabi	Additional tracks (FR track, etc.)

v3.0 Update: Dual-Track Architecture¶

Why the Change¶

Production experience with v2.3.0 revealed limitations of the single-multilingual-agent approach:

GPT-4o language confusion: The LLM occasionally output space-separated Chinese characters or mixed language fragments
STT accuracy: Deepgram nova-2 multi mode sacrifices per-language accuracy for breadth
TTS quality: ElevenLabs eleven_multilingual_v2 produces adequate but not native-quality Mandarin
Prompt bloat: Bilingual prompts are 40-60% longer than monolingual equivalents

v3.0 Decision¶

Use explicit language gate with per-language agent tracks:

+--------------------------------------------------+
| ROUTER (AssemblyAI Universal = bilingual STT)     |
| - Detect language from first utterance             |
| - Route to EN or ZH track                          |
+---------------------+----------------------------+
                      |
          +-----------+-----------+
          v                       v
    EN TRACK (4 agents)      ZH TRACK (4 agents)
    STT: Deepgram en         STT: Deepgram zh
    TTS: ElevenLabs          TTS: Azure Xiaoxiao

v3.0 vs v1.0 Comparison¶

Factor	v1.0 Single Agent	v3.0 Dual-Track
Agents per clinic	1	9
Language detection	Auto (Deepgram multi)	Explicit (Router gate)
Mid-call switching	Seamless	Stays in initial track
STT accuracy (per language)	Good	Better (dedicated model)
TTS quality (Mandarin)	Adequate	Native (Azure)
Prompt complexity	High (bilingual)	Low (monolingual)
LLM language confusion	Occasional	Eliminated
Configuration burden	Low	Higher (9 agents, GitOps manages it)

Trade-off: Mid-Call Language Switching¶

The main drawback of v3.0 is losing seamless mid-call language switching. If a patient starts in English and switches to Mandarin mid-conversation, v3.0 stays in the English track.

Mitigation: In BC demographics, most callers have a clear primary language preference. True bilingual callers who switch mid-call are a minority. The Router can re-route on subsequent calls if the patient calls back.

Consequences¶

v1.0-v2.3.0¶

Positive:

Simpler architecture (fewer moving parts)
Better patient experience (no IVR, seamless switching)
Lower operational overhead
Faster onboarding for new clinics

Negative:

Single point of failure per clinic (mitigated by Vapi.ai reliability)
System prompt complexity (one prompt handles all languages)
Testing matrix grows with each language added

v3.0 (Additional)¶