Skip to content

ADR: Multilingual Agent Strategy

IMPLEMENTED — v3.0 Dual-Track Architecture

The dual-track bilingual strategy is live in production. Router detects language from first utterance and routes to EN or ZH track. See Squad Architecture.

Date: January 2026 (v1.0), Updated February 2026 (v3.0) Status: Updated Decision: v1.0-v2.3.0: ONE multilingual agent per clinic with auto-detection. v3.0: Dual-track with explicit language gate.


Context

VitaraVox v1.0 supports English and Mandarin for BC healthcare clinics. The question: should we use separate agents per language or a single multilingual agent?


Decision

Use ONE multilingual Vapi.ai assistant per clinic with Deepgram Nova-2 "multi" mode for automatic language detection.


Architecture

+------------------------------------------------------------------+
|                                                                  |
|   Patient Call Flow                                              |
|   ================                                               |
|                                                                  |
|   Patient dials clinic number                                    |
|           |                                                      |
|           v                                                      |
|   +---------------+                                              |
|   |    Telnyx     |   <-- Receives call, routes to Vapi.ai       |
|   +-------+-------+                                              |
|           |                                                      |
|           v                                                      |
|   +------------------------------------------------------------------+
|   |                                                                  |
|   |          SINGLE MULTILINGUAL VAPI.AI ASSISTANT                   |
|   |                                                                  |
|   |   +----------------------------------------------------------+   |
|   |   |  Transcriber: Deepgram Nova-2 (language: "multi")        |   |
|   |   |  --> Auto-detects: English, Mandarin (v1.0)              |   |
|   |   |  --> Seamless mid-conversation switching                 |   |
|   |   +----------------------------------------------------------+   |
|   |                                                                  |
|   |   +----------------------------------------------------------+   |
|   |   |  LLM: GPT-4o (streaming enabled)                         |   |
|   |   |  --> Multilingual system prompt                          |   |
|   |   |  --> Tool calls to OSCAR EMR                             |   |
|   |   +----------------------------------------------------------+   |
|   |                                                                  |
|   |   +----------------------------------------------------------+   |
|   |   |  Voice: Azure TTS (multilingual-auto)                    |   |
|   |   |  --> en-US-AriaNeural (English)                          |   |
|   |   |  --> zh-CN-XiaoxiaoNeural (Mandarin)                     |   |
|   |   +----------------------------------------------------------+   |
|   |                                                                  |
|   +------------------------------------------------------------------+
|           |                                                      |
|           v  Tool Calls (real-time during conversation)          |
|   +---------------+                                              |
|   |    OSCAR      |   <-- check_availability, book_appointment   |
|   |     EMR       |                                              |
|   +---------------+                                              |
|                                                                  |
+------------------------------------------------------------------+

Comparison

Factor Single Multilingual Multiple Language Agents
Agents per clinic 1 5 (one per language)
Total for 5 clinics 5 25
Configuration burden Low 5x higher
Language switching Seamless Requires transfer (200-500ms)
Mid-conversation mix Supported Complex routing needed
IVR complexity None (auto-detect) "Press 1 for English..."
Maintenance Single prompt 5 prompts to sync

Rationale

1. Patient Demographics in BC

Many BC healthcare patients are bilingual (English/Mandarin). They may:

  • Start in one language, switch mid-sentence
  • Use English for medical terms, Mandarin for personal details
  • Have family members on the call speaking different languages

Single multilingual agent handles this naturally.

2. Latency

Target: <800ms end-to-end voice response

Approach Latency Impact
Single agent Optimal streaming, no transfers
Multiple agents +200-500ms per transfer/handoff

3. Configuration

v1.0 uses manual Vapi.ai setup:

Approach Setup Time (5 clinics)
Single multilingual 5 assistants = ~2 hours
Per-language agents 25 assistants = ~10 hours

4. User Experience

Single Multilingual Agent:

Patient: "Hello, I'd like to book an appointment"
Agent: "Hello! I'd be happy to help you book an appointment..."
Patient: "其实我想用中文" (Actually I want to use Chinese)
Agent: "没问题!请问您想预约什么时间?" (No problem! When would you like to book?)

Multiple Language Agents:

IVR: "Press 1 for English, 2 for 中文..."
Patient presses 1
English Agent: "Hello! How can I help?"
Patient: "Actually, can I switch to Chinese?"
Agent: "Let me transfer you..."
[200-500ms silence]
Chinese Agent: "您好,请问有什么可以帮您的?"
Patient: [repeats request]


Vapi.ai Configuration

Transcriber (Deepgram)

{
  "transcriber": {
    "provider": "deepgram",
    "model": "nova-2",
    "language": "multi",
    "smartFormat": true
  }
}

Voice (Azure TTS)

{
  "voice": {
    "provider": "azure",
    "voiceId": "en-US-AriaNeural",
    "multilingualSettings": {
      "enabled": true,
      "fallbackVoices": {
        "zh-CN": "zh-CN-XiaoxiaoNeural"
      }
    }
  }
}

Language Roadmap

Version Languages Approach
v1.0 English, Mandarin Single multilingual agent, auto-detect
v2.0-v2.3.0 EN, ZH, FR, PA Single multilingual agent, auto-detect
v3.0 English, Mandarin Dual-track with explicit language gate
Future + French, Cantonese, Punjabi Additional tracks (FR track, etc.)

v3.0 Update: Dual-Track Architecture

Why the Change

Production experience with v2.3.0 revealed limitations of the single-multilingual-agent approach:

  1. GPT-4o language confusion: The LLM occasionally output space-separated Chinese characters or mixed language fragments
  2. STT accuracy: Deepgram nova-2 multi mode sacrifices per-language accuracy for breadth
  3. TTS quality: ElevenLabs eleven_multilingual_v2 produces adequate but not native-quality Mandarin
  4. Prompt bloat: Bilingual prompts are 40-60% longer than monolingual equivalents

v3.0 Decision

Use explicit language gate with per-language agent tracks:

+--------------------------------------------------+
| ROUTER (AssemblyAI Universal = bilingual STT)     |
| - Detect language from first utterance             |
| - Route to EN or ZH track                          |
+---------------------+----------------------------+
                      |
          +-----------+-----------+
          v                       v
    EN TRACK (4 agents)      ZH TRACK (4 agents)
    STT: Deepgram en         STT: Deepgram zh
    TTS: ElevenLabs          TTS: Azure Xiaoxiao

v3.0 vs v1.0 Comparison

Factor v1.0 Single Agent v3.0 Dual-Track
Agents per clinic 1 9
Language detection Auto (Deepgram multi) Explicit (Router gate)
Mid-call switching Seamless Stays in initial track
STT accuracy (per language) Good Better (dedicated model)
TTS quality (Mandarin) Adequate Native (Azure)
Prompt complexity High (bilingual) Low (monolingual)
LLM language confusion Occasional Eliminated
Configuration burden Low Higher (9 agents, GitOps manages it)

Trade-off: Mid-Call Language Switching

The main drawback of v3.0 is losing seamless mid-call language switching. If a patient starts in English and switches to Mandarin mid-conversation, v3.0 stays in the English track.

Mitigation: In BC demographics, most callers have a clear primary language preference. True bilingual callers who switch mid-call are a minority. The Router can re-route on subsequent calls if the patient calls back.


Consequences

v1.0-v2.3.0

Positive:

  • Simpler architecture (fewer moving parts)
  • Better patient experience (no IVR, seamless switching)
  • Lower operational overhead
  • Faster onboarding for new clinics

Negative:

  • Single point of failure per clinic (mitigated by Vapi.ai reliability)
  • System prompt complexity (one prompt handles all languages)
  • Testing matrix grows with each language added

v3.0 (Additional)

Positive:

  • Higher per-language STT accuracy
  • Native-quality Mandarin TTS
  • Simpler, shorter monolingual prompts
  • Eliminates LLM language confusion
  • Consistent voice personality per language track

Negative:

  • 9x agents per clinic (mitigated by Vapi GitOps)
  • No mid-call language switching
  • Two parallel prompt sets to maintain (EN + ZH)
  • Adding a new language requires a full new track (4 agents)