LLM Control Plane¶

Multi-Model Routing, Per-Clinic Cost Tracking, and Failover¶

Date: February 18, 2026¶

Why You Need an LLM Control Plane¶

Today, VitaraVox has zero visibility into LLM costs, no failover if OpenAI goes down, and no ability to test different models. GPT-4o is hardwired into all 9 Vapi assistant configurations. When you onboard 50 clinics, these questions become urgent:

Question	Current Answer	With LLM Control Plane
What does Clinic A cost per call?	No idea	$0.14/call (dashboard shows it)
Can we use GPT-4o-mini for simple turns?	No (hardwired)	Yes — route by complexity
What if OpenAI has a 2-hour outage?	All 50 clinics down	Auto-failover to Claude in <1s
Is Claude better than GPT-4o for Chinese?	Can't test	A/B test with 80/20 split
Can Clinic B get a cheaper tier?	All clinics share one model	Budget model per pricing plan

Architecture: LiteLLM Proxy¶

LiteLLM is an open-source LLM gateway that provides a unified OpenAI-compatible API across 100+ providers. Self-hosted in your VPC, all data stays in Canada.

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│   Voice Agent Call Flow (with LLM Control Plane)                        │
│                                                                         │
│   ┌──────────┐                                                          │
│   │  Caller   │                                                          │
│   └────┬─────┘                                                          │
│        │                                                                │
│        ▼                                                                │
│   ┌──────────────────────────────────────────────────────────┐          │
│   │  Vapi Platform (Phase 0-4) / LiveKit (Phase 5)           │          │
│   │                                                          │          │
│   │  STT: Deepgram / AssemblyAI / Google                     │          │
│   │                                                          │          │
│   │  LLM Call ──────────────────────────────────────────┐    │          │
│   │  (custom-llm endpoint)                              │    │          │
│   │                                                     │    │          │
│   │  TTS: ElevenLabs / Azure / Qwen3-TTS               │    │          │
│   └──────────────────────────────────────────────────────┼────┘          │
│                                                          │              │
│                                                          ▼              │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                     LiteLLM Proxy Server                         │  │
│   │                (ECS Fargate, ca-central-1)                       │  │
│   │                                                                  │  │
│   │  Request arrives with headers:                                   │  │
│   │    x-clinic-id: "clinic_abc_123"                                 │  │
│   │    x-call-id: "call_xyz_789"                                     │  │
│   │    x-agent-name: "booking-en"                                    │  │
│   │    model: "voice-agent-primary"                                  │  │
│   │                                                                  │  │
│   │  ┌────────────────────────────────────────────────────────────┐  │  │
│   │  │  1. ROUTING ENGINE                                         │  │  │
│   │  │                                                            │  │  │
│   │  │  Strategy: latency-based (picks fastest healthy provider)  │  │  │
│   │  │                                                            │  │  │
│   │  │  Model Group "voice-agent-primary":                        │  │  │
│   │  │    ├── openai/gpt-4o         (weight: 0.8, priority: 1)   │  │  │
│   │  │    ├── anthropic/claude-sonnet (weight: 0.2, priority: 2)  │  │  │
│   │  │    └── google/gemini-2.5-flash (priority: 3, fallback)    │  │  │
│   │  │                                                            │  │  │
│   │  │  Model Group "voice-agent-fast":                           │  │  │
│   │  │    ├── openai/gpt-4o-mini    (priority: 1)                │  │  │
│   │  │    └── google/gemini-2.5-flash (priority: 2, fallback)    │  │  │
│   │  │                                                            │  │  │
│   │  │  Model Group "voice-agent-zh":                             │  │  │
│   │  │    ├── openai/gpt-4o         (priority: 1)                │  │  │
│   │  │    └── qwen3-72b (self-hosted) (priority: 2)              │  │  │
│   │  │                                                            │  │  │
│   │  │  Model Group "voice-agent-budget":                         │  │  │
│   │  │    └── openai/gpt-4.1-nano   (priority: 1)                │  │  │
│   │  └────────────────────────────────────────────────────────────┘  │  │
│   │                                                                  │  │
│   │  ┌────────────────────────────────────────────────────────────┐  │  │
│   │  │  2. COST TRACKER                                           │  │  │
│   │  │                                                            │  │  │
│   │  │  Every request logged:                                     │  │  │
│   │  │    {                                                       │  │  │
│   │  │      "clinic_id": "clinic_abc_123",                        │  │  │
│   │  │      "call_id": "call_xyz_789",                            │  │  │
│   │  │      "model": "gpt-4o",                                    │  │  │
│   │  │      "input_tokens": 1847,                                 │  │  │
│   │  │      "output_tokens": 312,                                 │  │  │
│   │  │      "cost_usd": 0.0153,                                   │  │  │
│   │  │      "latency_ms": 423,                                    │  │  │
│   │  │      "agent_name": "booking-en"                             │  │  │
│   │  │    }                                                       │  │  │
│   │  │                                                            │  │  │
│   │  │  Aggregations available:                                   │  │  │
│   │  │    GET /spend/tags?tag=clinic_id:abc → monthly spend       │  │  │
│   │  │    GET /spend/tags?tag=model:gpt-4o → model breakdown      │  │  │
│   │  │    GET /spend/tags?tag=agent:booking-en → per-agent cost   │  │  │
│   │  └────────────────────────────────────────────────────────────┘  │  │
│   │                                                                  │  │
│   │  ┌────────────────────────────────────────────────────────────┐  │  │
│   │  │  3. FAILOVER ENGINE                                        │  │  │
│   │  │                                                            │  │  │
│   │  │  If primary model fails (timeout, 5xx, rate limit):        │  │  │
│   │  │    1. Retry same model once (0.5s delay)                   │  │  │
│   │  │    2. If still failing, route to next priority in group    │  │  │
│   │  │    3. If all models in group fail, return error            │  │  │
│   │  │                                                            │  │  │
│   │  │  Health tracking:                                          │  │  │
│   │  │    - Tracks success/failure per model over sliding window  │  │  │
│   │  │    - Models with >50% failure rate deprioritized           │  │  │
│   │  │    - Automatic recovery when model starts succeeding       │  │  │
│   │  └────────────────────────────────────────────────────────────┘  │  │
│   │                                                                  │  │
│   │  ┌────────────────────────────────────────────────────────────┐  │  │
│   │  │  4. BUDGET ENFORCEMENT                                     │  │  │
│   │  │                                                            │  │  │
│   │  │  Per-clinic budget limits:                                 │  │  │
│   │  │    clinic_abc_123: max $200/month (Standard plan)          │  │  │
│   │  │    clinic_def_456: max $500/month (Enterprise plan)        │  │  │
│   │  │    clinic_ghi_789: max $50/month  (Starter plan)           │  │  │
│   │  │                                                            │  │  │
│   │  │  When budget reached:                                      │  │  │
│   │  │    - Alert admin dashboard                                 │  │  │
│   │  │    - Optionally: downgrade to budget model                 │  │  │
│   │  │    - Optionally: block new calls with message              │  │  │
│   │  └────────────────────────────────────────────────────────────┘  │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│        Routes to:                                                       │
│        ┌───────────┬───────────┬───────────┬───────────┬──────────┐    │
│        │  OpenAI   │ Anthropic │  Google   │Azure AOAI │Self-Host │    │
│        │           │           │  Vertex   │(CA East)  │(Qwen3)  │    │
│        │  gpt-4o   │  claude-  │  gemini-  │  gpt-4o   │  qwen3  │    │
│        │  gpt-4o-  │  sonnet   │  2.5-flash│  (CA data │  -72b   │    │
│        │  mini     │  claude-  │  gemini-  │  residency│         │    │
│        │  gpt-4.1- │  haiku    │  2.5-pro  │  option)  │         │    │
│        │  nano     │           │           │           │         │    │
│        └───────────┴───────────┴───────────┴───────────┴──────────┘    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Multi-Model Routing Strategies¶

Strategy 1: Latency-Based (Default for Voice)¶

Picks the model with the lowest current latency from the healthy pool. Critical for voice agents where every 100ms matters.

router_settings:
  routing_strategy: "latency-based-routing"
  num_retries: 2
  retry_after: 0.5
  allowed_fails: 2

Strategy 2: Weighted (A/B Testing)¶

Split traffic by percentage. Use for canary deployments of new models.

router_settings:
  routing_strategy: "weighted"
  model_group_weights:
    voice-agent-primary:
      openai/gpt-4o: 0.8
      anthropic/claude-sonnet-4-6: 0.2

Strategy 3: Cost-Based (Budget Clinics)¶

Route to cheapest model that meets quality threshold. Use for starter-plan clinics.

# Budget routing: gpt-4.1-nano first, escalate if needed
model_list:
  - model_name: "voice-agent-budget"
    litellm_params:
      model: openai/gpt-4.1-nano     # $0.10/$0.40 per 1M tokens
  - model_name: "voice-agent-budget"
    litellm_params:
      model: openai/gpt-4o-mini       # fallback if nano fails

Strategy 4: Complexity-Based (Smart Routing)¶

Route simple turns (greetings, confirmations) to fast/cheap models. Route complex turns (multi-constraint booking, rescheduling) to premium models.

Caller: "Yes, that works!"
  → voice-agent-fast (gpt-4o-mini, ~200ms, $0.0002)

Caller: "I need to see Dr. Chen next Thursday afternoon but not before 2pm
         and can you also cancel my Monday appointment?"
  → voice-agent-primary (gpt-4o, ~400ms, $0.015)

Implementation: Vapi custom LLM endpoint includes the system prompt. LiteLLM can route based on prompt length or custom header x-complexity: simple|complex set by a lightweight classifier.

Available Models (Yes, You Can Use Any of These)¶

LiteLLM supports 100+ LLM providers through a single API. Here are the ones relevant to healthcare voice agents:

Tier 1: Premium (Complex Tool-Calling)¶

Model	Provider	Latency (TTFT)	Cost (1M tokens)	Best For
GPT-4o	OpenAI	~350-700ms	$2.50 / $10.00	English + Chinese tool-calling
Claude Sonnet 4.6	Anthropic	~300-600ms	$3.00 / $15.00	Complex instructions, safety
Gemini 2.5 Pro	Google	~400-800ms	$1.25 / $10.00	Multilingual, long context
Claude Opus 4.6	Anthropic	~500-1000ms	$15.00 / $75.00	Highest quality (overkill for voice)

Tier 2: Fast (Simple Turns, Confirmations)¶

Model	Provider	Latency (TTFT)	Cost (1M tokens)	Best For
GPT-4o-mini	OpenAI	~200-400ms	$0.15 / $0.60	Fast responses, simple logic
Gemini 2.5 Flash	Google	~200-400ms	$0.15 / $0.60	Multilingual fast tier
Claude Haiku 4.5	Anthropic	~200-400ms	$0.80 / $4.00	Safety-focused fast tier

Tier 3: Budget (Starter Plan Clinics)¶

Model	Provider	Latency	Cost (1M tokens)	Best For
GPT-4.1-nano	OpenAI	~150-300ms	$0.10 / $0.40	Lowest cost, basic tool-calling

Tier 4: Specialized (Language-Specific)¶

Model	Provider	Latency	Cost	Best For
Qwen3-72B	Self-hosted (vLLM)	~200-400ms	Infra only	Chinese (native), 119 languages
Mistral Large 2	Mistral	~300-500ms	$2.00 / $6.00	French (native French company)
DeepSeek V3.1	DeepSeek	~300-500ms	Very low	Chinese (strong but tool-calling unstable)

Tier 5: Canadian Data Residency¶

Model	Provider	Region	Notes
GPT-4o	Azure OpenAI	Canada East	Same model, Canadian data residency
GPT-4o-mini	Azure OpenAI	Canada East	Same model, Canadian data residency
Claude Sonnet	AWS Bedrock	ca-central-1	Anthropic via Bedrock in Canada
Qwen3-72B	Self-hosted	ca-central-1	Full control, GPU instance required

Canadian Data Residency

For PHIPA/PIPEDA compliance, Azure OpenAI (Canada East) and AWS Bedrock (ca-central-1) keep all data in Canada. Direct OpenAI API routes through US servers. At enterprise scale, route through Azure OpenAI or Bedrock for Canadian clinics that require data residency.

Per-Clinic Cost Tracking¶

How It Works¶

Every LLM request is tagged with metadata headers. LiteLLM tracks cost per tag automatically.

Vapi/LiveKit → POST /v1/chat/completions
Headers:
  x-clinic-id: clinic_abc_123
  x-call-id: call_xyz_789
  x-agent-name: booking-en

LiteLLM logs:
  {
    "clinic_id": "clinic_abc_123",
    "call_id": "call_xyz_789",
    "agent_name": "booking-en",
    "model": "gpt-4o",
    "input_tokens": 1847,
    "output_tokens": 312,
    "cost_usd": 0.0153,
    "latency_ms": 423,
    "timestamp": "2026-02-18T14:32:00Z"
  }

Cost Queries¶

# Total spend for a clinic this month
GET /spend/tags?tag=clinic_id:clinic_abc_123&start_date=2026-02-01
→ { "total_spend": 42.17 }

# Breakdown by model
GET /spend/tags?tag=clinic_id:clinic_abc_123&group_by=model
→ { "gpt-4o": 38.50, "gpt-4o-mini": 3.67 }

# Breakdown by agent
GET /spend/tags?tag=clinic_id:clinic_abc_123&group_by=agent_name
→ { "booking-en": 18.20, "modification-en": 12.30, "patient-id-en": 8.00, ... }

# Cost per call
GET /spend/tags?tag=call_id:call_xyz_789
→ { "total_spend": 0.14, "turns": 8, "model": "gpt-4o" }

Typical Cost Per Call¶

Call Type	Turns	Model	Estimated Cost
Simple booking (confirm slot)	4-6	GPT-4o	$0.08-0.12
Complex booking (multiple attempts)	8-12	GPT-4o	$0.15-0.25
Reschedule	6-8	GPT-4o	$0.10-0.18
Registration (new patient)	10-15	GPT-4o	$0.20-0.35
Simple booking (budget)	4-6	GPT-4o-mini	$0.01-0.02

SaaS Pricing Implications¶

With per-clinic cost visibility, VitaraVox can offer tiered pricing:

Plan	LLM Model	Budget Cap	Price
Starter	GPT-4o-mini	$50/month	$99/month
Standard	GPT-4o	$200/month	$299/month
Enterprise	GPT-4o + Claude failover	$500/month	$599/month
Premium	GPT-4o + Canadian data residency	$800/month	$999/month

Integration with Vapi (Phase 4)¶

Vapi supports custom LLM endpoints. Instead of Vapi calling OpenAI directly, point it to your LiteLLM proxy:

# Vapi assistant config (in GitOps YAML frontmatter)
model:
  provider: custom-llm
  url: https://llm-proxy.internal.vitaravox.ca/v1
  model: voice-agent-primary
  headers:
    x-clinic-id: "{{clinicId}}"
    x-call-id: "{{callId}}"
    x-agent-name: "booking-en"

Vapi sends the LLM request to your proxy. Your proxy routes to the appropriate model, tracks cost, handles failover, and returns the response. Vapi never knows or cares which model actually answered.

Configuration Reference¶

Full LiteLLM Config¶

# litellm_config.yaml

model_list:
  # === PRIMARY: GPT-4o (complex tool-calling) ===
  - model_name: "voice-agent-primary"
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"
      timeout: 10
      max_retries: 1

  # === PRIMARY FAILOVER: Claude Sonnet ===
  - model_name: "voice-agent-primary"
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: "os.environ/ANTHROPIC_API_KEY"
      timeout: 10

  # === FAST: GPT-4o-mini (simple turns) ===
  - model_name: "voice-agent-fast"
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: "os.environ/OPENAI_API_KEY"
      timeout: 8

  # === FAST FAILOVER: Gemini Flash ===
  - model_name: "voice-agent-fast"
    litellm_params:
      model: vertex_ai/gemini-2.5-flash
      vertex_project: "vitaravox-prod"
      vertex_location: "northamerica-northeast1"
      timeout: 8

  # === CHINESE TRACK: GPT-4o (launch) ===
  - model_name: "voice-agent-zh"
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"
      timeout: 10

  # === CHINESE TRACK: Qwen3 (post-bake-off) ===
  - model_name: "voice-agent-zh"
    litellm_params:
      model: openai/qwen3-72b
      api_base: "http://10.0.1.50:8000/v1"
      timeout: 10

  # === BUDGET: GPT-4.1-nano (starter clinics) ===
  - model_name: "voice-agent-budget"
    litellm_params:
      model: openai/gpt-4.1-nano
      api_key: "os.environ/OPENAI_API_KEY"
      timeout: 8

  # === CANADIAN DATA RESIDENCY: Azure OpenAI ===
  - model_name: "voice-agent-canada"
    litellm_params:
      model: azure/gpt-4o
      api_base: "os.environ/AZURE_OPENAI_ENDPOINT"
      api_key: "os.environ/AZURE_OPENAI_KEY"
      api_version: "2024-12-01-preview"
      timeout: 10

router_settings:
  routing_strategy: "latency-based-routing"
  num_retries: 2
  retry_after: 0.5
  allowed_fails: 2
  cooldown_time: 60
  fallbacks:
    - voice-agent-primary: ["voice-agent-fast"]
    - voice-agent-zh: ["voice-agent-primary"]

litellm_settings:
  success_callback: ["langfuse"]     # optional: eval tracking
  cache: true                         # semantic caching (Redis-backed)
  cache_params:
    type: "redis"
    host: "os.environ/REDIS_HOST"
    port: 6379

general_settings:
  master_key: "os.environ/LITELLM_MASTER_KEY"
  database_url: "os.environ/DATABASE_URL"
  custom_auth: "custom_auth.auth_handler"    # per-clinic API key validation

Deployment¶

LiteLLM runs as a separate ECS Fargate service alongside the webhook server:

ECS Cluster
├── Service: vitara-admin-api (webhook server)
│   └── Task: 2-10 instances (auto-scaled)
│
├── Service: litellm-proxy (LLM gateway)
│   └── Task: 2 instances (always-on, internal ALB)
│
└── Service: otel-collector (observability)
    └── Task: 1 instance (sidecar pattern preferred)

Internal routing: webhook server calls http://litellm-proxy.internal:4000/v1/chat/completions. No public exposure. All traffic stays within VPC.