LLM Control Plane¶
Multi-Model Routing, Per-Clinic Cost Tracking, and Failover¶
Date: February 18, 2026¶
Why You Need an LLM Control Plane¶
Today, VitaraVox has zero visibility into LLM costs, no failover if OpenAI goes down, and no ability to test different models. GPT-4o is hardwired into all 9 Vapi assistant configurations. When you onboard 50 clinics, these questions become urgent:
| Question | Current Answer | With LLM Control Plane |
|---|---|---|
| What does Clinic A cost per call? | No idea | $0.14/call (dashboard shows it) |
| Can we use GPT-4o-mini for simple turns? | No (hardwired) | Yes — route by complexity |
| What if OpenAI has a 2-hour outage? | All 50 clinics down | Auto-failover to Claude in <1s |
| Is Claude better than GPT-4o for Chinese? | Can't test | A/B test with 80/20 split |
| Can Clinic B get a cheaper tier? | All clinics share one model | Budget model per pricing plan |
Architecture: LiteLLM Proxy¶
LiteLLM is an open-source LLM gateway that provides a unified OpenAI-compatible API across 100+ providers. Self-hosted in your VPC, all data stays in Canada.
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ Voice Agent Call Flow (with LLM Control Plane) │
│ │
│ ┌──────────┐ │
│ │ Caller │ │
│ └────┬─────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Vapi Platform (Phase 0-4) / LiveKit (Phase 5) │ │
│ │ │ │
│ │ STT: Deepgram / AssemblyAI / Google │ │
│ │ │ │
│ │ LLM Call ──────────────────────────────────────────┐ │ │
│ │ (custom-llm endpoint) │ │ │
│ │ │ │ │
│ │ TTS: ElevenLabs / Azure / Qwen3-TTS │ │ │
│ └──────────────────────────────────────────────────────┼────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ LiteLLM Proxy Server │ │
│ │ (ECS Fargate, ca-central-1) │ │
│ │ │ │
│ │ Request arrives with headers: │ │
│ │ x-clinic-id: "clinic_abc_123" │ │
│ │ x-call-id: "call_xyz_789" │ │
│ │ x-agent-name: "booking-en" │ │
│ │ model: "voice-agent-primary" │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────┐ │ │
│ │ │ 1. ROUTING ENGINE │ │ │
│ │ │ │ │ │
│ │ │ Strategy: latency-based (picks fastest healthy provider) │ │ │
│ │ │ │ │ │
│ │ │ Model Group "voice-agent-primary": │ │ │
│ │ │ ├── openai/gpt-4o (weight: 0.8, priority: 1) │ │ │
│ │ │ ├── anthropic/claude-sonnet (weight: 0.2, priority: 2) │ │ │
│ │ │ └── google/gemini-2.5-flash (priority: 3, fallback) │ │ │
│ │ │ │ │ │
│ │ │ Model Group "voice-agent-fast": │ │ │
│ │ │ ├── openai/gpt-4o-mini (priority: 1) │ │ │
│ │ │ └── google/gemini-2.5-flash (priority: 2, fallback) │ │ │
│ │ │ │ │ │
│ │ │ Model Group "voice-agent-zh": │ │ │
│ │ │ ├── openai/gpt-4o (priority: 1) │ │ │
│ │ │ └── qwen3-72b (self-hosted) (priority: 2) │ │ │
│ │ │ │ │ │
│ │ │ Model Group "voice-agent-budget": │ │ │
│ │ │ └── openai/gpt-4.1-nano (priority: 1) │ │ │
│ │ └────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────┐ │ │
│ │ │ 2. COST TRACKER │ │ │
│ │ │ │ │ │
│ │ │ Every request logged: │ │ │
│ │ │ { │ │ │
│ │ │ "clinic_id": "clinic_abc_123", │ │ │
│ │ │ "call_id": "call_xyz_789", │ │ │
│ │ │ "model": "gpt-4o", │ │ │
│ │ │ "input_tokens": 1847, │ │ │
│ │ │ "output_tokens": 312, │ │ │
│ │ │ "cost_usd": 0.0153, │ │ │
│ │ │ "latency_ms": 423, │ │ │
│ │ │ "agent_name": "booking-en" │ │ │
│ │ │ } │ │ │
│ │ │ │ │ │
│ │ │ Aggregations available: │ │ │
│ │ │ GET /spend/tags?tag=clinic_id:abc → monthly spend │ │ │
│ │ │ GET /spend/tags?tag=model:gpt-4o → model breakdown │ │ │
│ │ │ GET /spend/tags?tag=agent:booking-en → per-agent cost │ │ │
│ │ └────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────┐ │ │
│ │ │ 3. FAILOVER ENGINE │ │ │
│ │ │ │ │ │
│ │ │ If primary model fails (timeout, 5xx, rate limit): │ │ │
│ │ │ 1. Retry same model once (0.5s delay) │ │ │
│ │ │ 2. If still failing, route to next priority in group │ │ │
│ │ │ 3. If all models in group fail, return error │ │ │
│ │ │ │ │ │
│ │ │ Health tracking: │ │ │
│ │ │ - Tracks success/failure per model over sliding window │ │ │
│ │ │ - Models with >50% failure rate deprioritized │ │ │
│ │ │ - Automatic recovery when model starts succeeding │ │ │
│ │ └────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────┐ │ │
│ │ │ 4. BUDGET ENFORCEMENT │ │ │
│ │ │ │ │ │
│ │ │ Per-clinic budget limits: │ │ │
│ │ │ clinic_abc_123: max $200/month (Standard plan) │ │ │
│ │ │ clinic_def_456: max $500/month (Enterprise plan) │ │ │
│ │ │ clinic_ghi_789: max $50/month (Starter plan) │ │ │
│ │ │ │ │ │
│ │ │ When budget reached: │ │ │
│ │ │ - Alert admin dashboard │ │ │
│ │ │ - Optionally: downgrade to budget model │ │ │
│ │ │ - Optionally: block new calls with message │ │ │
│ │ └────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Routes to: │
│ ┌───────────┬───────────┬───────────┬───────────┬──────────┐ │
│ │ OpenAI │ Anthropic │ Google │Azure AOAI │Self-Host │ │
│ │ │ │ Vertex │(CA East) │(Qwen3) │ │
│ │ gpt-4o │ claude- │ gemini- │ gpt-4o │ qwen3 │ │
│ │ gpt-4o- │ sonnet │ 2.5-flash│ (CA data │ -72b │ │
│ │ mini │ claude- │ gemini- │ residency│ │ │
│ │ gpt-4.1- │ haiku │ 2.5-pro │ option) │ │ │
│ │ nano │ │ │ │ │ │
│ └───────────┴───────────┴───────────┴───────────┴──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Multi-Model Routing Strategies¶
Strategy 1: Latency-Based (Default for Voice)¶
Picks the model with the lowest current latency from the healthy pool. Critical for voice agents where every 100ms matters.
router_settings:
routing_strategy: "latency-based-routing"
num_retries: 2
retry_after: 0.5
allowed_fails: 2
Strategy 2: Weighted (A/B Testing)¶
Split traffic by percentage. Use for canary deployments of new models.
router_settings:
routing_strategy: "weighted"
model_group_weights:
voice-agent-primary:
openai/gpt-4o: 0.8
anthropic/claude-sonnet-4-6: 0.2
Strategy 3: Cost-Based (Budget Clinics)¶
Route to cheapest model that meets quality threshold. Use for starter-plan clinics.
# Budget routing: gpt-4.1-nano first, escalate if needed
model_list:
- model_name: "voice-agent-budget"
litellm_params:
model: openai/gpt-4.1-nano # $0.10/$0.40 per 1M tokens
- model_name: "voice-agent-budget"
litellm_params:
model: openai/gpt-4o-mini # fallback if nano fails
Strategy 4: Complexity-Based (Smart Routing)¶
Route simple turns (greetings, confirmations) to fast/cheap models. Route complex turns (multi-constraint booking, rescheduling) to premium models.
Caller: "Yes, that works!"
→ voice-agent-fast (gpt-4o-mini, ~200ms, $0.0002)
Caller: "I need to see Dr. Chen next Thursday afternoon but not before 2pm
and can you also cancel my Monday appointment?"
→ voice-agent-primary (gpt-4o, ~400ms, $0.015)
Implementation: Vapi custom LLM endpoint includes the system prompt. LiteLLM can route based on prompt length or custom header x-complexity: simple|complex set by a lightweight classifier.
Available Models (Yes, You Can Use Any of These)¶
LiteLLM supports 100+ LLM providers through a single API. Here are the ones relevant to healthcare voice agents:
Tier 1: Premium (Complex Tool-Calling)¶
| Model | Provider | Latency (TTFT) | Cost (1M tokens) | Best For |
|---|---|---|---|---|
| GPT-4o | OpenAI | ~350-700ms | $2.50 / $10.00 | English + Chinese tool-calling |
| Claude Sonnet 4.6 | Anthropic | ~300-600ms | $3.00 / $15.00 | Complex instructions, safety |
| Gemini 2.5 Pro | ~400-800ms | $1.25 / $10.00 | Multilingual, long context | |
| Claude Opus 4.6 | Anthropic | ~500-1000ms | $15.00 / $75.00 | Highest quality (overkill for voice) |
Tier 2: Fast (Simple Turns, Confirmations)¶
| Model | Provider | Latency (TTFT) | Cost (1M tokens) | Best For |
|---|---|---|---|---|
| GPT-4o-mini | OpenAI | ~200-400ms | $0.15 / $0.60 | Fast responses, simple logic |
| Gemini 2.5 Flash | ~200-400ms | $0.15 / $0.60 | Multilingual fast tier | |
| Claude Haiku 4.5 | Anthropic | ~200-400ms | $0.80 / $4.00 | Safety-focused fast tier |
Tier 3: Budget (Starter Plan Clinics)¶
| Model | Provider | Latency | Cost (1M tokens) | Best For |
|---|---|---|---|---|
| GPT-4.1-nano | OpenAI | ~150-300ms | $0.10 / $0.40 | Lowest cost, basic tool-calling |
Tier 4: Specialized (Language-Specific)¶
| Model | Provider | Latency | Cost | Best For |
|---|---|---|---|---|
| Qwen3-72B | Self-hosted (vLLM) | ~200-400ms | Infra only | Chinese (native), 119 languages |
| Mistral Large 2 | Mistral | ~300-500ms | $2.00 / $6.00 | French (native French company) |
| DeepSeek V3.1 | DeepSeek | ~300-500ms | Very low | Chinese (strong but tool-calling unstable) |
Tier 5: Canadian Data Residency¶
| Model | Provider | Region | Notes |
|---|---|---|---|
| GPT-4o | Azure OpenAI | Canada East | Same model, Canadian data residency |
| GPT-4o-mini | Azure OpenAI | Canada East | Same model, Canadian data residency |
| Claude Sonnet | AWS Bedrock | ca-central-1 | Anthropic via Bedrock in Canada |
| Qwen3-72B | Self-hosted | ca-central-1 | Full control, GPU instance required |
Canadian Data Residency
For PHIPA/PIPEDA compliance, Azure OpenAI (Canada East) and AWS Bedrock (ca-central-1) keep all data in Canada. Direct OpenAI API routes through US servers. At enterprise scale, route through Azure OpenAI or Bedrock for Canadian clinics that require data residency.
Per-Clinic Cost Tracking¶
How It Works¶
Every LLM request is tagged with metadata headers. LiteLLM tracks cost per tag automatically.
Vapi/LiveKit → POST /v1/chat/completions
Headers:
x-clinic-id: clinic_abc_123
x-call-id: call_xyz_789
x-agent-name: booking-en
LiteLLM logs:
{
"clinic_id": "clinic_abc_123",
"call_id": "call_xyz_789",
"agent_name": "booking-en",
"model": "gpt-4o",
"input_tokens": 1847,
"output_tokens": 312,
"cost_usd": 0.0153,
"latency_ms": 423,
"timestamp": "2026-02-18T14:32:00Z"
}
Cost Queries¶
# Total spend for a clinic this month
GET /spend/tags?tag=clinic_id:clinic_abc_123&start_date=2026-02-01
→ { "total_spend": 42.17 }
# Breakdown by model
GET /spend/tags?tag=clinic_id:clinic_abc_123&group_by=model
→ { "gpt-4o": 38.50, "gpt-4o-mini": 3.67 }
# Breakdown by agent
GET /spend/tags?tag=clinic_id:clinic_abc_123&group_by=agent_name
→ { "booking-en": 18.20, "modification-en": 12.30, "patient-id-en": 8.00, ... }
# Cost per call
GET /spend/tags?tag=call_id:call_xyz_789
→ { "total_spend": 0.14, "turns": 8, "model": "gpt-4o" }
Typical Cost Per Call¶
| Call Type | Turns | Model | Estimated Cost |
|---|---|---|---|
| Simple booking (confirm slot) | 4-6 | GPT-4o | $0.08-0.12 |
| Complex booking (multiple attempts) | 8-12 | GPT-4o | $0.15-0.25 |
| Reschedule | 6-8 | GPT-4o | $0.10-0.18 |
| Registration (new patient) | 10-15 | GPT-4o | $0.20-0.35 |
| Simple booking (budget) | 4-6 | GPT-4o-mini | $0.01-0.02 |
SaaS Pricing Implications¶
With per-clinic cost visibility, VitaraVox can offer tiered pricing:
| Plan | LLM Model | Budget Cap | Price |
|---|---|---|---|
| Starter | GPT-4o-mini | $50/month | $99/month |
| Standard | GPT-4o | $200/month | $299/month |
| Enterprise | GPT-4o + Claude failover | $500/month | $599/month |
| Premium | GPT-4o + Canadian data residency | $800/month | $999/month |
Integration with Vapi (Phase 4)¶
Vapi supports custom LLM endpoints. Instead of Vapi calling OpenAI directly, point it to your LiteLLM proxy:
# Vapi assistant config (in GitOps YAML frontmatter)
model:
provider: custom-llm
url: https://llm-proxy.internal.vitaravox.ca/v1
model: voice-agent-primary
headers:
x-clinic-id: "{{clinicId}}"
x-call-id: "{{callId}}"
x-agent-name: "booking-en"
Vapi sends the LLM request to your proxy. Your proxy routes to the appropriate model, tracks cost, handles failover, and returns the response. Vapi never knows or cares which model actually answered.
Configuration Reference¶
Full LiteLLM Config¶
# litellm_config.yaml
model_list:
# === PRIMARY: GPT-4o (complex tool-calling) ===
- model_name: "voice-agent-primary"
litellm_params:
model: openai/gpt-4o
api_key: "os.environ/OPENAI_API_KEY"
timeout: 10
max_retries: 1
# === PRIMARY FAILOVER: Claude Sonnet ===
- model_name: "voice-agent-primary"
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: "os.environ/ANTHROPIC_API_KEY"
timeout: 10
# === FAST: GPT-4o-mini (simple turns) ===
- model_name: "voice-agent-fast"
litellm_params:
model: openai/gpt-4o-mini
api_key: "os.environ/OPENAI_API_KEY"
timeout: 8
# === FAST FAILOVER: Gemini Flash ===
- model_name: "voice-agent-fast"
litellm_params:
model: vertex_ai/gemini-2.5-flash
vertex_project: "vitaravox-prod"
vertex_location: "northamerica-northeast1"
timeout: 8
# === CHINESE TRACK: GPT-4o (launch) ===
- model_name: "voice-agent-zh"
litellm_params:
model: openai/gpt-4o
api_key: "os.environ/OPENAI_API_KEY"
timeout: 10
# === CHINESE TRACK: Qwen3 (post-bake-off) ===
- model_name: "voice-agent-zh"
litellm_params:
model: openai/qwen3-72b
api_base: "http://10.0.1.50:8000/v1"
timeout: 10
# === BUDGET: GPT-4.1-nano (starter clinics) ===
- model_name: "voice-agent-budget"
litellm_params:
model: openai/gpt-4.1-nano
api_key: "os.environ/OPENAI_API_KEY"
timeout: 8
# === CANADIAN DATA RESIDENCY: Azure OpenAI ===
- model_name: "voice-agent-canada"
litellm_params:
model: azure/gpt-4o
api_base: "os.environ/AZURE_OPENAI_ENDPOINT"
api_key: "os.environ/AZURE_OPENAI_KEY"
api_version: "2024-12-01-preview"
timeout: 10
router_settings:
routing_strategy: "latency-based-routing"
num_retries: 2
retry_after: 0.5
allowed_fails: 2
cooldown_time: 60
fallbacks:
- voice-agent-primary: ["voice-agent-fast"]
- voice-agent-zh: ["voice-agent-primary"]
litellm_settings:
success_callback: ["langfuse"] # optional: eval tracking
cache: true # semantic caching (Redis-backed)
cache_params:
type: "redis"
host: "os.environ/REDIS_HOST"
port: 6379
general_settings:
master_key: "os.environ/LITELLM_MASTER_KEY"
database_url: "os.environ/DATABASE_URL"
custom_auth: "custom_auth.auth_handler" # per-clinic API key validation
Deployment¶
LiteLLM runs as a separate ECS Fargate service alongside the webhook server:
ECS Cluster
├── Service: vitara-admin-api (webhook server)
│ └── Task: 2-10 instances (auto-scaled)
│
├── Service: litellm-proxy (LLM gateway)
│ └── Task: 2 instances (always-on, internal ALB)
│
└── Service: otel-collector (observability)
└── Task: 1 instance (sidecar pattern preferred)
Internal routing: webhook server calls http://litellm-proxy.internal:4000/v1/chat/completions. No public exposure. All traffic stays within VPC.