Skip to content

Multilingual Voice Agent Tech Stack Advisory

VitaraVox Enterprise Readiness Analysis

Date: February 17, 2026

Research: Global best-in-class multilingual voice agent architecture


Multilingual Voice Agent Tech Stack: Complete Research Report

Executive Summary

After extensive research across voice orchestration platforms, STT/TTS/LLM providers, and open-source frameworks, here is a thorough analysis for building a 5-language (English, Mandarin, Cantonese, Punjabi, French-Canadian) healthcare voice agent with maximum control, lowest latency, and Canadian data residency.


1. Voice Orchestration Platforms

Comparison Matrix

Platform Control Mix-Match STT/TTS Multilingual Latency Self-Host HIPAA Cost
Pipecat Highest Yes - full pipeline Provider-dependent Lowest (~$0.01/min + infra) Yes (MIT) Via Pipecat Cloud Low at scale
LiveKit Agents Very High Yes - plugin system Provider-dependent Very Low Yes (Apache 2.0) Yes Low at scale
Vapi.ai Medium Limited per-assistant 100+ langs Medium (+50-100ms overhead) No Claimed $0.05-0.13/min
Retell.ai Low Limited Broad Medium (+50-100ms) No Yes $0.07-0.15/min
Bland.ai Medium Limited Broad Medium No Yes Enterprise pricing
Vocode Medium-High Yes Provider-dependent Medium Yes (open-source) DIY Low
PolyAI Low No (vertically integrated) 35+ langs Low No Yes ~$150K+/year
Parloa Medium Limited 35+ langs (EU focus) Medium No GDPR Enterprise
Custom WebRTC Maximum Yes Unlimited Lowest possible Yes DIY Highest dev cost

Key Findings

Pipecat (Daily.co) gives the MOST control. It is an MIT-licensed Python framework with full pipeline composability. You can literally swap any STT, LLM, or TTS provider per conversation turn. It supports Daily.co for transport (which is what Vapi itself uses under the hood). SOC2/HIPAA compliance is available through Pipecat Cloud, and you can self-host the identical code. Cost is $0.01/min for transport plus your provider costs.

LiveKit Agents is the strongest alternative. Apache 2.0 licensed, it has a built-in multilingual guide showing how to dynamically switch STT/TTS providers based on detected language mid-conversation. LiveKit has explicit HIPAA compliance, end-to-end encryption, and self-hosted deployment documentation. The plugin architecture supports Deepgram, AssemblyAI, Google, Azure, ElevenLabs, Cartesia, and more.

Vapi.ai is where you are today. While it covers 100+ languages and allows per-assistant STT/TTS configuration, you cannot dynamically switch providers mid-conversation based on detected language within a single assistant. The squad architecture partially addresses this (separate assistants per language track), but adds handoff latency and complexity. Vapi adds 50-100ms orchestration overhead on top of provider latency.

The verdict for orchestration: Pipecat or LiveKit Agents are definitively superior for your use case. Both allow per-language-track provider routing at the code level, self-hosting for Canadian data residency, and eliminate the managed platform overhead.

Sources: - Softcery: 11 Voice Agent Platforms Compared - Modal: LiveKit vs Vapi - AssemblyAI: 6 Best Orchestration Tools - Pipecat GitHub - LiveKit: Build a Multilingual Voice Agent - LiveKit Self-Hosted Deployments


2. STT (Speech-to-Text) — Per-Language Analysis

Language Support Matrix

Provider English Mandarin Cantonese Punjabi French-CA Streaming Medical Latency
Deepgram Nova-3 Best (3.45% WER medical) Yes (zh) NO (Nova-2 only via zh-HK) NO Yes (fr-CA) Yes Nova-3 Medical <300ms
Deepgram Nova-2 Good Yes (zh, zh-CN) Yes (zh-HK) NO Yes (fr-CA) Yes No <300ms
Google Chirp 2/3 Good Yes (cmn-Hans-CN) Yes (yue-Hant-HK) Yes (pa-Guru-IN) Yes (fr-CA) Yes (Chirp 3) No Medium
Azure Speech Good Yes Yes (yue-CN, zh-HK) Yes (pa-IN) Yes (fr-CA) Yes No Medium
AssemblyAI Universal Very Good Yes Unclear/Limited Yes Yes Yes (42 streaming langs) No ~250ms
Whisper large-v3 Very Good (7.4% WER avg) Yes Yes (yue - new token) Yes Yes Batch only* No High (batch)
Speechmatics Very Good (90%+) Yes Yes Unclear Yes Yes (<250ms partial) Healthcare focus <250ms
Sarvam AI Saaras v3 Indian English No No Yes (best-in-class) No Yes No Low
Gladia Solaria 94% WAR Yes Unclear Unclear Yes Yes No 270ms
AWS Transcribe Good Yes Unclear Unclear Yes (fr-CA) Yes Amazon Transcribe Medical Medium

Per-Language Recommendations

English: Deepgram Nova-3 Medical is the clear winner — 3.45% median WER, sub-300ms latency, HIPAA compliant, medical terminology support with keyterm prompting. For healthcare, this is unbeatable.

Mandarin Chinese: Deepgram Nova-3 (best streaming latency) or Google Chirp 3 (broadest feature set). Both strong performers. Speechmatics is also competitive.

Cantonese (HARDEST): This is the critical gap. Your best options are: 1. Google Chirp 2/3 — Confirmed support for yue-Hant-HK with streaming. Best documented option. 2. Azure Speech — Supports yue-CN and zh-HK for recognition. 3. Whisper large-v3 — Added Cantonese as a dedicated language token. Self-hostable. ~8% WER on Fleurs dataset. 4. Deepgram Nova-2 — Supports zh-HK but only on the older Nova-2 model, not Nova-3. 5. Speechmatics — Lists Cantonese among 55+ languages.

Punjabi (SECOND HARDEST): Your best options are: 1. Sarvam AI Saaras v3 — Purpose-built for Indian languages, consistently ranks #1-2 across Indian language benchmarks. 22 Indian languages including Punjabi. Indo-Aryan languages like Punjabi achieve ~5-6% WER. 2. Google Chirp 2/3 — Confirmed support for pa-Guru-IN. 3. Azure Speech — Supports pa-IN with both STT and TTS. 4. AssemblyAI Universal — Lists Punjabi among 99 supported languages. 5. Whisper large-v3 — Supports Punjabi.

French-Canadian: Well supported across most providers. Deepgram Nova-3 (fr-CA), Google Chirp 3, Azure, and AssemblyAI all provide strong fr-CA support.

Sources: - Deepgram Models & Languages Overview - Deepgram Nova-3 Medical - Google Cloud STT Supported Languages - Azure Language and Voice Support - AssemblyAI 99 Languages - Sarvam AI Speech to Text - Whisper large-v3 on HuggingFace


3. LLM Options for Multilingual Tool-Calling

Comparison Matrix

Model English Tool-Call Chinese Tool-Call Punjabi/Hindi French Latency Self-Host Cost
GPT-4o Excellent Very Good Good Excellent ~350-700ms TTFT No $2.50/$10 per 1M tokens
GPT-4o-mini Very Good Good Good Very Good ~200-400ms No $0.15/$0.60 per 1M
Claude Opus 4.6 Excellent Good Good Excellent ~500-1000ms No $15/$75 per 1M
Claude Sonnet 4.6 Excellent Good Good Very Good ~300-600ms No $3/$15 per 1M
Gemini 2.5 Flash Very Good (71.5% ComplexFuncBench) Very Good Good Very Good ~200-400ms No $0.15/$0.60 per 1M
Gemini 2.5 Pro Excellent Excellent Good Excellent ~400-800ms No $1.25/$10 per 1M
Qwen3 (72B) Very Good Best (native) Good (119 langs) Good ~234ms cold-start Yes (open) Free self-hosted
DeepSeek V3.1 Good (improved) Excellent (native) Fair Good Medium Yes (open) Very cheap
Mistral Large 2 Very Good Good Good (Hindi) Excellent (native French) Medium Partial $2/$6 per 1M
Command R+ Good Fair Fair Good Medium No $2.50/$10 per 1M

Per-Language LLM Recommendations

English: GPT-4o or Claude Sonnet 4.6. Both excellent at tool-calling in English. For voice agents, GPT-4o-mini or Gemini 2.5 Flash offer the best latency-to-quality ratio.

Mandarin/Cantonese: Qwen3 is the strongest for Chinese — purpose-built by Alibaba with native Chinese training data. Supports Cantonese specifically. DeepSeek V3.1 is also very strong on Chinese but has documented tool-calling instability ("looped calls or empty responses"). GPT-4o is the safe enterprise choice.

Punjabi: No model excels specifically at Punjabi tool-calling. GPT-4o is the most reliable general-purpose option. Qwen3 supports 119 languages including Punjabi.

French-Canadian: Mistral Large 2 is purpose-built for French. GPT-4o and Claude are also very strong.

Critical Insight: DeepSeek V3 Tool-Calling

VitaraVox MEMORY.md already captures this: "DeepSeek V3 tool_choice:'auto' unreliable (3-15% failure) — do NOT use as launch LLM for ZH." This remains true. While DeepSeek V3.1 improved stability, it is still not reliable enough for production healthcare tool-calling where a missed function call could mean a missed appointment.

Emerging Option: Gemini 2.5 Flash Native Audio

Gemini 2.5 Flash with native audio can process speech directly without STT, call tools, and generate speech output — all in a single model with 90% instruction adherence. It achieved 71.5% on ComplexFuncBench Audio. This could eventually eliminate the entire STT-LLM-TTS pipeline, but it currently limits provider flexibility and language-specific optimization.

Sources: - Qwen3 Blog - Qwen3-Omni: 119 Languages - DeepSeek Function Calling Docs - Gemini 2.5 Flash Native Audio


4. TTS (Text-to-Speech) — Per-Language Analysis

Comparison Matrix

Provider English Mandarin Cantonese Punjabi French-CA Latency (TTFA) Voice Cloning Self-Host
ElevenLabs v3 Best Good Supported (limited) Supported Good (via French) 75ms (Flash v2.5) Yes No
Azure Neural Very Good Very Good Yes (3 voices: HiuGaai, HiuMaan, WanLung) Yes (new GA voices) Yes (4 voices: Sylvie, Jean, Antoine, Thierry + HD) ~100ms No No
Google Chirp 3 HD Very Good Very Good Yes (Preview) Yes (Preview) Good Low No No
Cartesia Sonic 3 Very Good Supported Unclear Unclear Supported 40ms TTFA No No
LMNT Very Good Limited Unlikely Unlikely Limited <300ms Yes (5s clip) No
Sarvam Bulbul v3 Indian English No No Yes (best for Punjabi) No Low No No
Qwen3-TTS Good Best Yes (9 Chinese dialects incl. Cantonese) No No 97ms Yes (3s clip) Yes (Apache 2.0)
PlayHT 3.0 Mini Very Good Supported Likely (142 langs) Likely Likely Low Yes No
Coqui XTTS-v2 Good Good (zh-cn) No (16 langs only) No Yes (fr) <150ms Yes (6s clip) Yes (AGPL)
Amazon Polly Good Good No No Yes (fr-CA) Medium No No

Per-Language TTS Recommendations

English: ElevenLabs v3 remains the gold standard for naturalness and expressiveness. Cartesia Sonic 3 at 40ms TTFA is best for ultra-low latency. For healthcare, ElevenLabs Multilingual v2 with professional voices is recommended.

Mandarin Chinese: Qwen3-TTS is the standout — open-source (Apache 2.0), self-hostable, 97ms latency, and purpose-built for Chinese with the best prosody and naturalness in Chinese. Azure is the strong cloud alternative with multiple zh-CN voices including HD.

Cantonese (HARDEST for TTS): 1. Qwen3-TTS — Explicitly supports 9 Chinese dialects including Cantonese. Self-hostable. This is the strongest option. 2. Azure Speech — 3 dedicated Cantonese voices (zh-HK): HiuGaaiNeural (female), HiuMaanNeural (female), WanLungNeural (male). Plus XiaoxiaoDialectsNeural with yue-CN secondary locale. 3. Google Chirp 3 HD — Cantonese (yue-HK) in Preview. 4. ElevenLabs — Supports Cantonese but quality for Cantonese specifically is unvalidated compared to Mandarin.

Punjabi (SECOND HARDEST for TTS): 1. Sarvam AI Bulbul v3 — Purpose-built for 11 Indian languages including Punjabi. Lowest Character Error Rate across Indian domains. Best prosody for Indian languages. 2. Azure Speech — New Punjabi (pa-IN) neural voices, both male and female, now GA. 3. Google Chirp 3 HD — Punjabi (pa-IN) in Preview. 4. ElevenLabs — Supports Punjabi but quality is rated "Good" (10-25% error range).

French-Canadian: 1. Azure Speech — 4 dedicated fr-CA voices including Dragon HD Latest versions for Sylvie and Thierry. The most mature option with the highest quality HD voices. 2. ElevenLabs — Strong French support through multilingual models. 3. Google Chirp 3 HD — fr-CA supported.

Sources: - ElevenLabs Models - Azure Speech Language Support - Google Chirp 3 HD Voices - Cartesia Sonic 3 - Sarvam AI Bulbul v3 - Qwen3-TTS GitHub


5. Open-Source / Self-Hosted Options for Canadian Healthcare

Framework Comparison

Framework License Self-Host HIPAA Multilingual Routing Maturity Community
LiveKit Agents Apache 2.0 Full Yes Yes (documented guide) Production Large (28K+ stars)
Pipecat MIT (BSD for Cloud) Full (framework), Daily.co for transport Via Pipecat Cloud Yes (pipeline swappable) Production Growing (8K+ stars)
Vocode MIT Full DIY Yes (composable) Maturing Medium
Ultravox Apache 2.0 Full (model weights on HuggingFace) DIY 42 languages natively Research-to-Production Growing
Moshi (Kyutai) Apache 2.0 Full DIY Limited Research Small

Canadian Data Residency Architecture

For PHIPA/PIPA/PIPEDA compliance:

  1. Compute: Deploy on AWS Canada Central (ca-central-1), Azure Canada Central/East, or Google Cloud northamerica-northeast1 (Montreal).
  2. Voice Transport: Self-hosted LiveKit server or Daily.co (Pipecat) in Canadian region.
  3. STT Processing: Google Chirp (available in regional endpoints), Azure Speech (Canada regions), or self-hosted Whisper/Sarvam on Canadian GPU instances.
  4. LLM: Azure OpenAI Service in Canada East (GPT-4o available), or self-hosted Qwen3 on Canadian GPU.
  5. TTS: Azure Speech (Canada regions), self-hosted Qwen3-TTS, or Google Cloud TTS (Montreal endpoint).
  6. Key Principle: All PHI must be encrypted at rest and in transit. No PHI should leave Canadian borders.

Best Self-Hosted Stack for Maximum Control

LiveKit Agents is the recommendation for Canadian healthcare deployment: - Full self-hosting on Canadian infrastructure - Documented HIPAA compliance - Built-in multilingual voice agent guide with per-language provider switching - Plugin ecosystem supports all major STT/TTS/LLM providers - WebRTC transport handles telephony, web, and mobile - Hardware-accelerated VAD for interruption handling - Production-ready with large enterprise customer base

Sources: - LiveKit Agents GitHub - LiveKit Self-Hosted Deployments - Pipecat GitHub - Canadian Data Residency and Cloud


6. THE "UNBEATABLE" STACK RECOMMENDATION

Architecture: LiveKit Agents on Canadian Infrastructure

Current Infrastructure

VitaraVox currently runs on OCI ARM (Toronto). The dev OSCAR instance runs on AWS EC2 (ca-central-1). The recommendations below assume migration to AWS ca-central-1 as part of the enterprise stack migration.

Orchestrator: LiveKit Agents (self-hosted on AWS ca-central-1 or Azure Canada Central)

Language Detection & Routing: Router pattern similar to current Vapi v3.0 squad, but implemented in code: - Initial language detection via AssemblyAI Universal or Deepgram Nova-3 multi-language mode - Route to language-specific pipeline with optimized STT/LLM/TTS per track

Optimal STT Per Language

Language Primary STT Fallback STT Rationale
English Deepgram Nova-3 Medical Google Chirp 3 3.45% WER, medical terminology, sub-300ms, HIPAA
Mandarin Deepgram Nova-3 (zh) Google Chirp 3 (cmn-Hans-CN) Best latency + accuracy for Mandarin
Cantonese Google Chirp 3 (yue-Hant-HK) Azure Speech (zh-HK) Only tier-1 providers with confirmed Cantonese streaming
Punjabi Sarvam AI Saaras v3 Google Chirp 3 (pa-Guru-IN) Sarvam ranks #1 for Indian languages; Google as cloud fallback
French-CA Deepgram Nova-3 (fr-CA) Google Chirp 3 (fr-CA) Deepgram best latency; Google best for accent robustness

Optimal LLM Per Language

Language Primary LLM Rationale
English GPT-4o-mini or Gemini 2.5 Flash Best latency-to-quality for voice agents
Mandarin GPT-4o (launch) / Qwen3-72B (post-launch) GPT-4o reliable for tool-calling. Qwen3 stronger on Chinese but needs validation.
Cantonese GPT-4o Most reliable tool-calling. System prompt instructs Cantonese output.
Punjabi GPT-4o No model specializes in Punjabi tool-calling. GPT-4o is safest.
French-CA GPT-4o-mini or Mistral Large 2 Mistral is natively French but GPT-4o-mini is faster for voice.

Optimal TTS Per Language

Language Primary TTS Fallback TTS Rationale
English ElevenLabs v3 / Cartesia Sonic 3 Azure Neural HD ElevenLabs = best quality; Cartesia = lowest latency (40ms TTFA)
Mandarin Qwen3-TTS (self-hosted) Azure (zh-CN-XiaoxiaoNeural) Qwen3-TTS best Chinese quality, self-hostable, 97ms latency
Cantonese Qwen3-TTS (self-hosted) Azure (zh-HK-HiuGaaiNeural) Qwen3-TTS explicitly supports Cantonese dialect. Azure has 3 dedicated voices.
Punjabi Sarvam AI Bulbul v3 Azure (pa-IN Neural) Sarvam purpose-built for Indian languages with best CER
French-CA Azure (fr-CA-SylvieNeural HD) ElevenLabs Multilingual v2 Azure has 4 dedicated fr-CA voices with Dragon HD. Best quality for Quebec French.

Should You Stay on Vapi or Migrate?

Migrate. Here is why:

  1. Vapi cannot dynamically route STT/TTS per detected language within a single call. Your current v3.0 squad architecture with 9 assistants is a workaround, not a solution. Each handoff adds latency and complexity.

  2. Vapi adds 50-100ms overhead per turn on top of provider latency. At scale, this compounds.

  3. Vapi cannot use Sarvam AI for Punjabi STT/TTS or Qwen3-TTS for Cantonese. These are your best-in-class providers for underserved languages, and Vapi only integrates with a fixed set of providers.

  4. Canadian data residency is impossible with Vapi. You cannot control where Vapi processes audio. With self-hosted LiveKit/Pipecat, you control every data flow.

  5. Cost at scale. At 50K+ minutes/month, self-hosting saves 60-80% versus Vapi.

  6. Healthcare compliance. Self-hosted gives you full audit trails, encryption control, and PHIPA compliance documentation that a managed platform cannot provide.

Migration Strategy

Phase 1 (Month 1-2): Build English-only LiveKit Agent with Deepgram Nova-3 Medical + GPT-4o-mini + ElevenLabs. Validate latency, tool-calling, and call quality against Vapi v3.0 baseline.

Phase 2 (Month 2-3): Add Mandarin track with Deepgram Nova-3 (zh) + GPT-4o + Qwen3-TTS. Implement language detection router.

Phase 3 (Month 3-4): Add Cantonese track with Google Chirp 3 (yue) + GPT-4o + Qwen3-TTS (Cantonese). This is the hardest track — validate Cantonese STT accuracy extensively.

Phase 4 (Month 4-5): Add Punjabi track with Sarvam Saaras v3 + GPT-4o + Sarvam Bulbul v3. Add French-CA track with Deepgram Nova-3 (fr-CA) + GPT-4o-mini + Azure fr-CA HD.

Phase 5 (Month 5-6): Full regression testing across all 5 languages. PHIPA compliance audit. Production cutover.

Keep Vapi v3.0 running in parallel throughout this period as your production fallback.

The "Wild Card" Option: Gemini 2.5 Flash Native Audio

Google's Gemini 2.5 Flash with native audio processing is a potential game-changer that could eventually collapse the entire STT-LLM-TTS pipeline into a single model: - Processes speech natively (no STT step) - 71.5% on ComplexFuncBench Audio (leading) - 90% instruction adherence - Supports 70+ languages with mid-conversation switching - Speech-to-speech translation built in

However, it is NOT recommended as primary architecture today because: - You lose per-language STT/TTS optimization (no Sarvam for Punjabi, no Qwen3-TTS for Cantonese) - Canadian data residency is difficult with Google's API - Medical domain vocabulary cannot be customized (unlike Deepgram Nova-3 Medical keyterm prompting) - Still in active development; behavior may change

Monitor this closely. In 12-18 months, a hybrid approach using Gemini Native Audio for common languages and specialized providers for underserved languages may be optimal.

Estimated Per-Language Latency (Self-Hosted Stack)

Component Latency Cumulative
VAD + End-of-turn ~200-400ms 200-400ms
STT (Deepgram streaming) ~150-300ms 350-700ms
LLM (GPT-4o-mini TTFT) ~200-400ms 550-1100ms
TTS (Qwen3-TTS/ElevenLabs TTFA) ~75-100ms 625-1200ms
Total response time 625ms-1.2s

This is competitive with human conversation response times (300-500ms for listening + 300-500ms for formulation) and significantly better than Vapi's typical 1.5-2.5s response times.

Cost Comparison (at 50K minutes/month)

Stack Monthly Cost
Vapi v3.0 (current) ~$5,000-6,500 (platform + providers)
Self-hosted LiveKit + best-in-class providers ~$1,500-2,500 (infra + provider APIs)
Self-hosted LiveKit + self-hosted models (Qwen3-TTS, Whisper) ~$800-1,500 (GPU infra + minimal APIs)

Final Architecture Diagram

ORCHESTRATOR:  LiveKit Agents (self-hosted, Canadian region)
               OR Pipecat (if you prefer Python-first)

TRANSPORT:     LiveKit Server (self-hosted, WebRTC)
               + Twilio/Telnyx for PSTN telephony

LANGUAGE       AssemblyAI Universal (initial detection)
DETECTION:     then route to language-specific pipeline

         +-----------+--------------+-------------------+----------------+-----------+
         | ENGLISH   | MANDARIN     | CANTONESE         | PUNJABI        | FRENCH-CA |
STT:     | Deepgram  | Deepgram     | Google Chirp 3    | Sarvam AI      | Deepgram  |
         | Nova-3    | Nova-3 (zh)  | (yue-Hant-HK)    | Saaras v3      | Nova-3    |
         | Medical   |              |                   |                | (fr-CA)   |
         |           |              |                   |                |           |
LLM:     | GPT-4o    | GPT-4o       | GPT-4o            | GPT-4o         | GPT-4o-   |
         | mini      |              |                   |                | mini      |
         |           |              |                   |                |           |
TTS:     | ElevenLabs| Qwen3-TTS    | Qwen3-TTS         | Sarvam AI      | Azure     |
         | v3 or     | (self-hosted)| (self-hosted,      | Bulbul v3      | fr-CA-    |
         | Cartesia  |              |  Cantonese dialect)|                | Sylvie HD |
         +-----------+--------------+-------------------+----------------+-----------+

INFRA:   AWS ca-central-1 (or Azure Canada Central)
         GPU instances for Qwen3-TTS self-hosting
         All PHI stays in Canada

This stack gives you the highest accuracy per language, lowest latency, maximum control, Canadian data residency, and 60-80% cost reduction versus Vapi at scale.