Multilingual Voice Agent Tech Stack Advisory¶
VitaraVox Enterprise Readiness Analysis¶
Date: February 17, 2026¶
Research: Global best-in-class multilingual voice agent architecture¶
Multilingual Voice Agent Tech Stack: Complete Research Report¶
Executive Summary¶
After extensive research across voice orchestration platforms, STT/TTS/LLM providers, and open-source frameworks, here is a thorough analysis for building a 5-language (English, Mandarin, Cantonese, Punjabi, French-Canadian) healthcare voice agent with maximum control, lowest latency, and Canadian data residency.
1. Voice Orchestration Platforms¶
Comparison Matrix¶
| Platform | Control | Mix-Match STT/TTS | Multilingual | Latency | Self-Host | HIPAA | Cost |
|---|---|---|---|---|---|---|---|
| Pipecat | Highest | Yes - full pipeline | Provider-dependent | Lowest (~$0.01/min + infra) | Yes (MIT) | Via Pipecat Cloud | Low at scale |
| LiveKit Agents | Very High | Yes - plugin system | Provider-dependent | Very Low | Yes (Apache 2.0) | Yes | Low at scale |
| Vapi.ai | Medium | Limited per-assistant | 100+ langs | Medium (+50-100ms overhead) | No | Claimed | $0.05-0.13/min |
| Retell.ai | Low | Limited | Broad | Medium (+50-100ms) | No | Yes | $0.07-0.15/min |
| Bland.ai | Medium | Limited | Broad | Medium | No | Yes | Enterprise pricing |
| Vocode | Medium-High | Yes | Provider-dependent | Medium | Yes (open-source) | DIY | Low |
| PolyAI | Low | No (vertically integrated) | 35+ langs | Low | No | Yes | ~$150K+/year |
| Parloa | Medium | Limited | 35+ langs (EU focus) | Medium | No | GDPR | Enterprise |
| Custom WebRTC | Maximum | Yes | Unlimited | Lowest possible | Yes | DIY | Highest dev cost |
Key Findings¶
Pipecat (Daily.co) gives the MOST control. It is an MIT-licensed Python framework with full pipeline composability. You can literally swap any STT, LLM, or TTS provider per conversation turn. It supports Daily.co for transport (which is what Vapi itself uses under the hood). SOC2/HIPAA compliance is available through Pipecat Cloud, and you can self-host the identical code. Cost is $0.01/min for transport plus your provider costs.
LiveKit Agents is the strongest alternative. Apache 2.0 licensed, it has a built-in multilingual guide showing how to dynamically switch STT/TTS providers based on detected language mid-conversation. LiveKit has explicit HIPAA compliance, end-to-end encryption, and self-hosted deployment documentation. The plugin architecture supports Deepgram, AssemblyAI, Google, Azure, ElevenLabs, Cartesia, and more.
Vapi.ai is where you are today. While it covers 100+ languages and allows per-assistant STT/TTS configuration, you cannot dynamically switch providers mid-conversation based on detected language within a single assistant. The squad architecture partially addresses this (separate assistants per language track), but adds handoff latency and complexity. Vapi adds 50-100ms orchestration overhead on top of provider latency.
The verdict for orchestration: Pipecat or LiveKit Agents are definitively superior for your use case. Both allow per-language-track provider routing at the code level, self-hosting for Canadian data residency, and eliminate the managed platform overhead.
Sources: - Softcery: 11 Voice Agent Platforms Compared - Modal: LiveKit vs Vapi - AssemblyAI: 6 Best Orchestration Tools - Pipecat GitHub - LiveKit: Build a Multilingual Voice Agent - LiveKit Self-Hosted Deployments
2. STT (Speech-to-Text) — Per-Language Analysis¶
Language Support Matrix¶
| Provider | English | Mandarin | Cantonese | Punjabi | French-CA | Streaming | Medical | Latency |
|---|---|---|---|---|---|---|---|---|
| Deepgram Nova-3 | Best (3.45% WER medical) | Yes (zh) | NO (Nova-2 only via zh-HK) | NO | Yes (fr-CA) | Yes | Nova-3 Medical | <300ms |
| Deepgram Nova-2 | Good | Yes (zh, zh-CN) | Yes (zh-HK) | NO | Yes (fr-CA) | Yes | No | <300ms |
| Google Chirp 2/3 | Good | Yes (cmn-Hans-CN) | Yes (yue-Hant-HK) | Yes (pa-Guru-IN) | Yes (fr-CA) | Yes (Chirp 3) | No | Medium |
| Azure Speech | Good | Yes | Yes (yue-CN, zh-HK) | Yes (pa-IN) | Yes (fr-CA) | Yes | No | Medium |
| AssemblyAI Universal | Very Good | Yes | Unclear/Limited | Yes | Yes | Yes (42 streaming langs) | No | ~250ms |
| Whisper large-v3 | Very Good (7.4% WER avg) | Yes | Yes (yue - new token) | Yes | Yes | Batch only* | No | High (batch) |
| Speechmatics | Very Good (90%+) | Yes | Yes | Unclear | Yes | Yes (<250ms partial) | Healthcare focus | <250ms |
| Sarvam AI Saaras v3 | Indian English | No | No | Yes (best-in-class) | No | Yes | No | Low |
| Gladia Solaria | 94% WAR | Yes | Unclear | Unclear | Yes | Yes | No | 270ms |
| AWS Transcribe | Good | Yes | Unclear | Unclear | Yes (fr-CA) | Yes | Amazon Transcribe Medical | Medium |
Per-Language Recommendations¶
English: Deepgram Nova-3 Medical is the clear winner — 3.45% median WER, sub-300ms latency, HIPAA compliant, medical terminology support with keyterm prompting. For healthcare, this is unbeatable.
Mandarin Chinese: Deepgram Nova-3 (best streaming latency) or Google Chirp 3 (broadest feature set). Both strong performers. Speechmatics is also competitive.
Cantonese (HARDEST): This is the critical gap. Your best options are:
1. Google Chirp 2/3 — Confirmed support for yue-Hant-HK with streaming. Best documented option.
2. Azure Speech — Supports yue-CN and zh-HK for recognition.
3. Whisper large-v3 — Added Cantonese as a dedicated language token. Self-hostable. ~8% WER on Fleurs dataset.
4. Deepgram Nova-2 — Supports zh-HK but only on the older Nova-2 model, not Nova-3.
5. Speechmatics — Lists Cantonese among 55+ languages.
Punjabi (SECOND HARDEST): Your best options are:
1. Sarvam AI Saaras v3 — Purpose-built for Indian languages, consistently ranks #1-2 across Indian language benchmarks. 22 Indian languages including Punjabi. Indo-Aryan languages like Punjabi achieve ~5-6% WER.
2. Google Chirp 2/3 — Confirmed support for pa-Guru-IN.
3. Azure Speech — Supports pa-IN with both STT and TTS.
4. AssemblyAI Universal — Lists Punjabi among 99 supported languages.
5. Whisper large-v3 — Supports Punjabi.
French-Canadian: Well supported across most providers. Deepgram Nova-3 (fr-CA), Google Chirp 3, Azure, and AssemblyAI all provide strong fr-CA support.
Sources: - Deepgram Models & Languages Overview - Deepgram Nova-3 Medical - Google Cloud STT Supported Languages - Azure Language and Voice Support - AssemblyAI 99 Languages - Sarvam AI Speech to Text - Whisper large-v3 on HuggingFace
3. LLM Options for Multilingual Tool-Calling¶
Comparison Matrix¶
| Model | English Tool-Call | Chinese Tool-Call | Punjabi/Hindi | French | Latency | Self-Host | Cost |
|---|---|---|---|---|---|---|---|
| GPT-4o | Excellent | Very Good | Good | Excellent | ~350-700ms TTFT | No | $2.50/$10 per 1M tokens |
| GPT-4o-mini | Very Good | Good | Good | Very Good | ~200-400ms | No | $0.15/$0.60 per 1M |
| Claude Opus 4.6 | Excellent | Good | Good | Excellent | ~500-1000ms | No | $15/$75 per 1M |
| Claude Sonnet 4.6 | Excellent | Good | Good | Very Good | ~300-600ms | No | $3/$15 per 1M |
| Gemini 2.5 Flash | Very Good (71.5% ComplexFuncBench) | Very Good | Good | Very Good | ~200-400ms | No | $0.15/$0.60 per 1M |
| Gemini 2.5 Pro | Excellent | Excellent | Good | Excellent | ~400-800ms | No | $1.25/$10 per 1M |
| Qwen3 (72B) | Very Good | Best (native) | Good (119 langs) | Good | ~234ms cold-start | Yes (open) | Free self-hosted |
| DeepSeek V3.1 | Good (improved) | Excellent (native) | Fair | Good | Medium | Yes (open) | Very cheap |
| Mistral Large 2 | Very Good | Good | Good (Hindi) | Excellent (native French) | Medium | Partial | $2/$6 per 1M |
| Command R+ | Good | Fair | Fair | Good | Medium | No | $2.50/$10 per 1M |
Per-Language LLM Recommendations¶
English: GPT-4o or Claude Sonnet 4.6. Both excellent at tool-calling in English. For voice agents, GPT-4o-mini or Gemini 2.5 Flash offer the best latency-to-quality ratio.
Mandarin/Cantonese: Qwen3 is the strongest for Chinese — purpose-built by Alibaba with native Chinese training data. Supports Cantonese specifically. DeepSeek V3.1 is also very strong on Chinese but has documented tool-calling instability ("looped calls or empty responses"). GPT-4o is the safe enterprise choice.
Punjabi: No model excels specifically at Punjabi tool-calling. GPT-4o is the most reliable general-purpose option. Qwen3 supports 119 languages including Punjabi.
French-Canadian: Mistral Large 2 is purpose-built for French. GPT-4o and Claude are also very strong.
Critical Insight: DeepSeek V3 Tool-Calling¶
VitaraVox MEMORY.md already captures this: "DeepSeek V3 tool_choice:'auto' unreliable (3-15% failure) — do NOT use as launch LLM for ZH." This remains true. While DeepSeek V3.1 improved stability, it is still not reliable enough for production healthcare tool-calling where a missed function call could mean a missed appointment.
Emerging Option: Gemini 2.5 Flash Native Audio¶
Gemini 2.5 Flash with native audio can process speech directly without STT, call tools, and generate speech output — all in a single model with 90% instruction adherence. It achieved 71.5% on ComplexFuncBench Audio. This could eventually eliminate the entire STT-LLM-TTS pipeline, but it currently limits provider flexibility and language-specific optimization.
Sources: - Qwen3 Blog - Qwen3-Omni: 119 Languages - DeepSeek Function Calling Docs - Gemini 2.5 Flash Native Audio
4. TTS (Text-to-Speech) — Per-Language Analysis¶
Comparison Matrix¶
| Provider | English | Mandarin | Cantonese | Punjabi | French-CA | Latency (TTFA) | Voice Cloning | Self-Host |
|---|---|---|---|---|---|---|---|---|
| ElevenLabs v3 | Best | Good | Supported (limited) | Supported | Good (via French) | 75ms (Flash v2.5) | Yes | No |
| Azure Neural | Very Good | Very Good | Yes (3 voices: HiuGaai, HiuMaan, WanLung) | Yes (new GA voices) | Yes (4 voices: Sylvie, Jean, Antoine, Thierry + HD) | ~100ms | No | No |
| Google Chirp 3 HD | Very Good | Very Good | Yes (Preview) | Yes (Preview) | Good | Low | No | No |
| Cartesia Sonic 3 | Very Good | Supported | Unclear | Unclear | Supported | 40ms TTFA | No | No |
| LMNT | Very Good | Limited | Unlikely | Unlikely | Limited | <300ms | Yes (5s clip) | No |
| Sarvam Bulbul v3 | Indian English | No | No | Yes (best for Punjabi) | No | Low | No | No |
| Qwen3-TTS | Good | Best | Yes (9 Chinese dialects incl. Cantonese) | No | No | 97ms | Yes (3s clip) | Yes (Apache 2.0) |
| PlayHT 3.0 Mini | Very Good | Supported | Likely (142 langs) | Likely | Likely | Low | Yes | No |
| Coqui XTTS-v2 | Good | Good (zh-cn) | No (16 langs only) | No | Yes (fr) | <150ms | Yes (6s clip) | Yes (AGPL) |
| Amazon Polly | Good | Good | No | No | Yes (fr-CA) | Medium | No | No |
Per-Language TTS Recommendations¶
English: ElevenLabs v3 remains the gold standard for naturalness and expressiveness. Cartesia Sonic 3 at 40ms TTFA is best for ultra-low latency. For healthcare, ElevenLabs Multilingual v2 with professional voices is recommended.
Mandarin Chinese: Qwen3-TTS is the standout — open-source (Apache 2.0), self-hostable, 97ms latency, and purpose-built for Chinese with the best prosody and naturalness in Chinese. Azure is the strong cloud alternative with multiple zh-CN voices including HD.
Cantonese (HARDEST for TTS): 1. Qwen3-TTS — Explicitly supports 9 Chinese dialects including Cantonese. Self-hostable. This is the strongest option. 2. Azure Speech — 3 dedicated Cantonese voices (zh-HK): HiuGaaiNeural (female), HiuMaanNeural (female), WanLungNeural (male). Plus XiaoxiaoDialectsNeural with yue-CN secondary locale. 3. Google Chirp 3 HD — Cantonese (yue-HK) in Preview. 4. ElevenLabs — Supports Cantonese but quality for Cantonese specifically is unvalidated compared to Mandarin.
Punjabi (SECOND HARDEST for TTS): 1. Sarvam AI Bulbul v3 — Purpose-built for 11 Indian languages including Punjabi. Lowest Character Error Rate across Indian domains. Best prosody for Indian languages. 2. Azure Speech — New Punjabi (pa-IN) neural voices, both male and female, now GA. 3. Google Chirp 3 HD — Punjabi (pa-IN) in Preview. 4. ElevenLabs — Supports Punjabi but quality is rated "Good" (10-25% error range).
French-Canadian: 1. Azure Speech — 4 dedicated fr-CA voices including Dragon HD Latest versions for Sylvie and Thierry. The most mature option with the highest quality HD voices. 2. ElevenLabs — Strong French support through multilingual models. 3. Google Chirp 3 HD — fr-CA supported.
Sources: - ElevenLabs Models - Azure Speech Language Support - Google Chirp 3 HD Voices - Cartesia Sonic 3 - Sarvam AI Bulbul v3 - Qwen3-TTS GitHub
5. Open-Source / Self-Hosted Options for Canadian Healthcare¶
Framework Comparison¶
| Framework | License | Self-Host | HIPAA | Multilingual Routing | Maturity | Community |
|---|---|---|---|---|---|---|
| LiveKit Agents | Apache 2.0 | Full | Yes | Yes (documented guide) | Production | Large (28K+ stars) |
| Pipecat | MIT (BSD for Cloud) | Full (framework), Daily.co for transport | Via Pipecat Cloud | Yes (pipeline swappable) | Production | Growing (8K+ stars) |
| Vocode | MIT | Full | DIY | Yes (composable) | Maturing | Medium |
| Ultravox | Apache 2.0 | Full (model weights on HuggingFace) | DIY | 42 languages natively | Research-to-Production | Growing |
| Moshi (Kyutai) | Apache 2.0 | Full | DIY | Limited | Research | Small |
Canadian Data Residency Architecture¶
For PHIPA/PIPA/PIPEDA compliance:
- Compute: Deploy on AWS Canada Central (ca-central-1), Azure Canada Central/East, or Google Cloud northamerica-northeast1 (Montreal).
- Voice Transport: Self-hosted LiveKit server or Daily.co (Pipecat) in Canadian region.
- STT Processing: Google Chirp (available in regional endpoints), Azure Speech (Canada regions), or self-hosted Whisper/Sarvam on Canadian GPU instances.
- LLM: Azure OpenAI Service in Canada East (GPT-4o available), or self-hosted Qwen3 on Canadian GPU.
- TTS: Azure Speech (Canada regions), self-hosted Qwen3-TTS, or Google Cloud TTS (Montreal endpoint).
- Key Principle: All PHI must be encrypted at rest and in transit. No PHI should leave Canadian borders.
Best Self-Hosted Stack for Maximum Control¶
LiveKit Agents is the recommendation for Canadian healthcare deployment: - Full self-hosting on Canadian infrastructure - Documented HIPAA compliance - Built-in multilingual voice agent guide with per-language provider switching - Plugin ecosystem supports all major STT/TTS/LLM providers - WebRTC transport handles telephony, web, and mobile - Hardware-accelerated VAD for interruption handling - Production-ready with large enterprise customer base
Sources: - LiveKit Agents GitHub - LiveKit Self-Hosted Deployments - Pipecat GitHub - Canadian Data Residency and Cloud
6. THE "UNBEATABLE" STACK RECOMMENDATION¶
Architecture: LiveKit Agents on Canadian Infrastructure¶
Current Infrastructure
VitaraVox currently runs on OCI ARM (Toronto). The dev OSCAR instance runs on AWS EC2 (ca-central-1). The recommendations below assume migration to AWS ca-central-1 as part of the enterprise stack migration.
Orchestrator: LiveKit Agents (self-hosted on AWS ca-central-1 or Azure Canada Central)
Language Detection & Routing: Router pattern similar to current Vapi v3.0 squad, but implemented in code: - Initial language detection via AssemblyAI Universal or Deepgram Nova-3 multi-language mode - Route to language-specific pipeline with optimized STT/LLM/TTS per track
Optimal STT Per Language¶
| Language | Primary STT | Fallback STT | Rationale |
|---|---|---|---|
| English | Deepgram Nova-3 Medical | Google Chirp 3 | 3.45% WER, medical terminology, sub-300ms, HIPAA |
| Mandarin | Deepgram Nova-3 (zh) | Google Chirp 3 (cmn-Hans-CN) | Best latency + accuracy for Mandarin |
| Cantonese | Google Chirp 3 (yue-Hant-HK) | Azure Speech (zh-HK) | Only tier-1 providers with confirmed Cantonese streaming |
| Punjabi | Sarvam AI Saaras v3 | Google Chirp 3 (pa-Guru-IN) | Sarvam ranks #1 for Indian languages; Google as cloud fallback |
| French-CA | Deepgram Nova-3 (fr-CA) | Google Chirp 3 (fr-CA) | Deepgram best latency; Google best for accent robustness |
Optimal LLM Per Language¶
| Language | Primary LLM | Rationale |
|---|---|---|
| English | GPT-4o-mini or Gemini 2.5 Flash | Best latency-to-quality for voice agents |
| Mandarin | GPT-4o (launch) / Qwen3-72B (post-launch) | GPT-4o reliable for tool-calling. Qwen3 stronger on Chinese but needs validation. |
| Cantonese | GPT-4o | Most reliable tool-calling. System prompt instructs Cantonese output. |
| Punjabi | GPT-4o | No model specializes in Punjabi tool-calling. GPT-4o is safest. |
| French-CA | GPT-4o-mini or Mistral Large 2 | Mistral is natively French but GPT-4o-mini is faster for voice. |
Optimal TTS Per Language¶
| Language | Primary TTS | Fallback TTS | Rationale |
|---|---|---|---|
| English | ElevenLabs v3 / Cartesia Sonic 3 | Azure Neural HD | ElevenLabs = best quality; Cartesia = lowest latency (40ms TTFA) |
| Mandarin | Qwen3-TTS (self-hosted) | Azure (zh-CN-XiaoxiaoNeural) | Qwen3-TTS best Chinese quality, self-hostable, 97ms latency |
| Cantonese | Qwen3-TTS (self-hosted) | Azure (zh-HK-HiuGaaiNeural) | Qwen3-TTS explicitly supports Cantonese dialect. Azure has 3 dedicated voices. |
| Punjabi | Sarvam AI Bulbul v3 | Azure (pa-IN Neural) | Sarvam purpose-built for Indian languages with best CER |
| French-CA | Azure (fr-CA-SylvieNeural HD) | ElevenLabs Multilingual v2 | Azure has 4 dedicated fr-CA voices with Dragon HD. Best quality for Quebec French. |
Should You Stay on Vapi or Migrate?¶
Migrate. Here is why:
-
Vapi cannot dynamically route STT/TTS per detected language within a single call. Your current v3.0 squad architecture with 9 assistants is a workaround, not a solution. Each handoff adds latency and complexity.
-
Vapi adds 50-100ms overhead per turn on top of provider latency. At scale, this compounds.
-
Vapi cannot use Sarvam AI for Punjabi STT/TTS or Qwen3-TTS for Cantonese. These are your best-in-class providers for underserved languages, and Vapi only integrates with a fixed set of providers.
-
Canadian data residency is impossible with Vapi. You cannot control where Vapi processes audio. With self-hosted LiveKit/Pipecat, you control every data flow.
-
Cost at scale. At 50K+ minutes/month, self-hosting saves 60-80% versus Vapi.
-
Healthcare compliance. Self-hosted gives you full audit trails, encryption control, and PHIPA compliance documentation that a managed platform cannot provide.
Migration Strategy¶
Phase 1 (Month 1-2): Build English-only LiveKit Agent with Deepgram Nova-3 Medical + GPT-4o-mini + ElevenLabs. Validate latency, tool-calling, and call quality against Vapi v3.0 baseline.
Phase 2 (Month 2-3): Add Mandarin track with Deepgram Nova-3 (zh) + GPT-4o + Qwen3-TTS. Implement language detection router.
Phase 3 (Month 3-4): Add Cantonese track with Google Chirp 3 (yue) + GPT-4o + Qwen3-TTS (Cantonese). This is the hardest track — validate Cantonese STT accuracy extensively.
Phase 4 (Month 4-5): Add Punjabi track with Sarvam Saaras v3 + GPT-4o + Sarvam Bulbul v3. Add French-CA track with Deepgram Nova-3 (fr-CA) + GPT-4o-mini + Azure fr-CA HD.
Phase 5 (Month 5-6): Full regression testing across all 5 languages. PHIPA compliance audit. Production cutover.
Keep Vapi v3.0 running in parallel throughout this period as your production fallback.
The "Wild Card" Option: Gemini 2.5 Flash Native Audio¶
Google's Gemini 2.5 Flash with native audio processing is a potential game-changer that could eventually collapse the entire STT-LLM-TTS pipeline into a single model: - Processes speech natively (no STT step) - 71.5% on ComplexFuncBench Audio (leading) - 90% instruction adherence - Supports 70+ languages with mid-conversation switching - Speech-to-speech translation built in
However, it is NOT recommended as primary architecture today because: - You lose per-language STT/TTS optimization (no Sarvam for Punjabi, no Qwen3-TTS for Cantonese) - Canadian data residency is difficult with Google's API - Medical domain vocabulary cannot be customized (unlike Deepgram Nova-3 Medical keyterm prompting) - Still in active development; behavior may change
Monitor this closely. In 12-18 months, a hybrid approach using Gemini Native Audio for common languages and specialized providers for underserved languages may be optimal.
Estimated Per-Language Latency (Self-Hosted Stack)¶
| Component | Latency | Cumulative |
|---|---|---|
| VAD + End-of-turn | ~200-400ms | 200-400ms |
| STT (Deepgram streaming) | ~150-300ms | 350-700ms |
| LLM (GPT-4o-mini TTFT) | ~200-400ms | 550-1100ms |
| TTS (Qwen3-TTS/ElevenLabs TTFA) | ~75-100ms | 625-1200ms |
| Total response time | 625ms-1.2s |
This is competitive with human conversation response times (300-500ms for listening + 300-500ms for formulation) and significantly better than Vapi's typical 1.5-2.5s response times.
Cost Comparison (at 50K minutes/month)¶
| Stack | Monthly Cost |
|---|---|
| Vapi v3.0 (current) | ~$5,000-6,500 (platform + providers) |
| Self-hosted LiveKit + best-in-class providers | ~$1,500-2,500 (infra + provider APIs) |
| Self-hosted LiveKit + self-hosted models (Qwen3-TTS, Whisper) | ~$800-1,500 (GPU infra + minimal APIs) |
Final Architecture Diagram¶
ORCHESTRATOR: LiveKit Agents (self-hosted, Canadian region)
OR Pipecat (if you prefer Python-first)
TRANSPORT: LiveKit Server (self-hosted, WebRTC)
+ Twilio/Telnyx for PSTN telephony
LANGUAGE AssemblyAI Universal (initial detection)
DETECTION: then route to language-specific pipeline
+-----------+--------------+-------------------+----------------+-----------+
| ENGLISH | MANDARIN | CANTONESE | PUNJABI | FRENCH-CA |
STT: | Deepgram | Deepgram | Google Chirp 3 | Sarvam AI | Deepgram |
| Nova-3 | Nova-3 (zh) | (yue-Hant-HK) | Saaras v3 | Nova-3 |
| Medical | | | | (fr-CA) |
| | | | | |
LLM: | GPT-4o | GPT-4o | GPT-4o | GPT-4o | GPT-4o- |
| mini | | | | mini |
| | | | | |
TTS: | ElevenLabs| Qwen3-TTS | Qwen3-TTS | Sarvam AI | Azure |
| v3 or | (self-hosted)| (self-hosted, | Bulbul v3 | fr-CA- |
| Cartesia | | Cantonese dialect)| | Sylvie HD |
+-----------+--------------+-------------------+----------------+-----------+
INFRA: AWS ca-central-1 (or Azure Canada Central)
GPU instances for Qwen3-TTS self-hosting
All PHI stays in Canada
This stack gives you the highest accuracy per language, lowest latency, maximum control, Canadian data residency, and 60-80% cost reduction versus Vapi at scale.