Skip to content

LLM Control Plane

Multi-Model Routing, Per-Clinic Cost Tracking, and Failover

Date: February 18, 2026


Why You Need an LLM Control Plane

Today, VitaraVox has zero visibility into LLM costs, no failover if OpenAI goes down, and no ability to test different models. GPT-4o is hardwired into all 9 Vapi assistant configurations. When you onboard 50 clinics, these questions become urgent:

Question Current Answer With LLM Control Plane
What does Clinic A cost per call? No idea $0.14/call (dashboard shows it)
Can we use GPT-4o-mini for simple turns? No (hardwired) Yes — route by complexity
What if OpenAI has a 2-hour outage? All 50 clinics down Auto-failover to Claude in <1s
Is Claude better than GPT-4o for Chinese? Can't test A/B test with 80/20 split
Can Clinic B get a cheaper tier? All clinics share one model Budget model per pricing plan

Architecture: LiteLLM Proxy

LiteLLM is an open-source LLM gateway that provides a unified OpenAI-compatible API across 100+ providers. Self-hosted in your VPC, all data stays in Canada.

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│   Voice Agent Call Flow (with LLM Control Plane)                        │
│                                                                         │
│   ┌──────────┐                                                          │
│   │  Caller   │                                                          │
│   └────┬─────┘                                                          │
│        │                                                                │
│        ▼                                                                │
│   ┌──────────────────────────────────────────────────────────┐          │
│   │  Vapi Platform (Phase 0-4) / LiveKit (Phase 5)           │          │
│   │                                                          │          │
│   │  STT: Deepgram / AssemblyAI / Google                     │          │
│   │                                                          │          │
│   │  LLM Call ──────────────────────────────────────────┐    │          │
│   │  (custom-llm endpoint)                              │    │          │
│   │                                                     │    │          │
│   │  TTS: ElevenLabs / Azure / Qwen3-TTS               │    │          │
│   └──────────────────────────────────────────────────────┼────┘          │
│                                                          │              │
│                                                          ▼              │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                     LiteLLM Proxy Server                         │  │
│   │                (ECS Fargate, ca-central-1)                       │  │
│   │                                                                  │  │
│   │  Request arrives with headers:                                   │  │
│   │    x-clinic-id: "clinic_abc_123"                                 │  │
│   │    x-call-id: "call_xyz_789"                                     │  │
│   │    x-agent-name: "booking-en"                                    │  │
│   │    model: "voice-agent-primary"                                  │  │
│   │                                                                  │  │
│   │  ┌────────────────────────────────────────────────────────────┐  │  │
│   │  │  1. ROUTING ENGINE                                         │  │  │
│   │  │                                                            │  │  │
│   │  │  Strategy: latency-based (picks fastest healthy provider)  │  │  │
│   │  │                                                            │  │  │
│   │  │  Model Group "voice-agent-primary":                        │  │  │
│   │  │    ├── openai/gpt-4o         (weight: 0.8, priority: 1)   │  │  │
│   │  │    ├── anthropic/claude-sonnet (weight: 0.2, priority: 2)  │  │  │
│   │  │    └── google/gemini-2.5-flash (priority: 3, fallback)    │  │  │
│   │  │                                                            │  │  │
│   │  │  Model Group "voice-agent-fast":                           │  │  │
│   │  │    ├── openai/gpt-4o-mini    (priority: 1)                │  │  │
│   │  │    └── google/gemini-2.5-flash (priority: 2, fallback)    │  │  │
│   │  │                                                            │  │  │
│   │  │  Model Group "voice-agent-zh":                             │  │  │
│   │  │    ├── openai/gpt-4o         (priority: 1)                │  │  │
│   │  │    └── qwen3-72b (self-hosted) (priority: 2)              │  │  │
│   │  │                                                            │  │  │
│   │  │  Model Group "voice-agent-budget":                         │  │  │
│   │  │    └── openai/gpt-4.1-nano   (priority: 1)                │  │  │
│   │  └────────────────────────────────────────────────────────────┘  │  │
│   │                                                                  │  │
│   │  ┌────────────────────────────────────────────────────────────┐  │  │
│   │  │  2. COST TRACKER                                           │  │  │
│   │  │                                                            │  │  │
│   │  │  Every request logged:                                     │  │  │
│   │  │    {                                                       │  │  │
│   │  │      "clinic_id": "clinic_abc_123",                        │  │  │
│   │  │      "call_id": "call_xyz_789",                            │  │  │
│   │  │      "model": "gpt-4o",                                    │  │  │
│   │  │      "input_tokens": 1847,                                 │  │  │
│   │  │      "output_tokens": 312,                                 │  │  │
│   │  │      "cost_usd": 0.0153,                                   │  │  │
│   │  │      "latency_ms": 423,                                    │  │  │
│   │  │      "agent_name": "booking-en"                             │  │  │
│   │  │    }                                                       │  │  │
│   │  │                                                            │  │  │
│   │  │  Aggregations available:                                   │  │  │
│   │  │    GET /spend/tags?tag=clinic_id:abc → monthly spend       │  │  │
│   │  │    GET /spend/tags?tag=model:gpt-4o → model breakdown      │  │  │
│   │  │    GET /spend/tags?tag=agent:booking-en → per-agent cost   │  │  │
│   │  └────────────────────────────────────────────────────────────┘  │  │
│   │                                                                  │  │
│   │  ┌────────────────────────────────────────────────────────────┐  │  │
│   │  │  3. FAILOVER ENGINE                                        │  │  │
│   │  │                                                            │  │  │
│   │  │  If primary model fails (timeout, 5xx, rate limit):        │  │  │
│   │  │    1. Retry same model once (0.5s delay)                   │  │  │
│   │  │    2. If still failing, route to next priority in group    │  │  │
│   │  │    3. If all models in group fail, return error            │  │  │
│   │  │                                                            │  │  │
│   │  │  Health tracking:                                          │  │  │
│   │  │    - Tracks success/failure per model over sliding window  │  │  │
│   │  │    - Models with >50% failure rate deprioritized           │  │  │
│   │  │    - Automatic recovery when model starts succeeding       │  │  │
│   │  └────────────────────────────────────────────────────────────┘  │  │
│   │                                                                  │  │
│   │  ┌────────────────────────────────────────────────────────────┐  │  │
│   │  │  4. BUDGET ENFORCEMENT                                     │  │  │
│   │  │                                                            │  │  │
│   │  │  Per-clinic budget limits:                                 │  │  │
│   │  │    clinic_abc_123: max $200/month (Standard plan)          │  │  │
│   │  │    clinic_def_456: max $500/month (Enterprise plan)        │  │  │
│   │  │    clinic_ghi_789: max $50/month  (Starter plan)           │  │  │
│   │  │                                                            │  │  │
│   │  │  When budget reached:                                      │  │  │
│   │  │    - Alert admin dashboard                                 │  │  │
│   │  │    - Optionally: downgrade to budget model                 │  │  │
│   │  │    - Optionally: block new calls with message              │  │  │
│   │  └────────────────────────────────────────────────────────────┘  │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│        Routes to:                                                       │
│        ┌───────────┬───────────┬───────────┬───────────┬──────────┐    │
│        │  OpenAI   │ Anthropic │  Google   │Azure AOAI │Self-Host │    │
│        │           │           │  Vertex   │(CA East)  │(Qwen3)  │    │
│        │  gpt-4o   │  claude-  │  gemini-  │  gpt-4o   │  qwen3  │    │
│        │  gpt-4o-  │  sonnet   │  2.5-flash│  (CA data │  -72b   │    │
│        │  mini     │  claude-  │  gemini-  │  residency│         │    │
│        │  gpt-4.1- │  haiku    │  2.5-pro  │  option)  │         │    │
│        │  nano     │           │           │           │         │    │
│        └───────────┴───────────┴───────────┴───────────┴──────────┘    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Multi-Model Routing Strategies

Strategy 1: Latency-Based (Default for Voice)

Picks the model with the lowest current latency from the healthy pool. Critical for voice agents where every 100ms matters.

router_settings:
  routing_strategy: "latency-based-routing"
  num_retries: 2
  retry_after: 0.5
  allowed_fails: 2

Strategy 2: Weighted (A/B Testing)

Split traffic by percentage. Use for canary deployments of new models.

router_settings:
  routing_strategy: "weighted"
  model_group_weights:
    voice-agent-primary:
      openai/gpt-4o: 0.8
      anthropic/claude-sonnet-4-6: 0.2

Strategy 3: Cost-Based (Budget Clinics)

Route to cheapest model that meets quality threshold. Use for starter-plan clinics.

# Budget routing: gpt-4.1-nano first, escalate if needed
model_list:
  - model_name: "voice-agent-budget"
    litellm_params:
      model: openai/gpt-4.1-nano     # $0.10/$0.40 per 1M tokens
  - model_name: "voice-agent-budget"
    litellm_params:
      model: openai/gpt-4o-mini       # fallback if nano fails

Strategy 4: Complexity-Based (Smart Routing)

Route simple turns (greetings, confirmations) to fast/cheap models. Route complex turns (multi-constraint booking, rescheduling) to premium models.

Caller: "Yes, that works!"
  → voice-agent-fast (gpt-4o-mini, ~200ms, $0.0002)

Caller: "I need to see Dr. Chen next Thursday afternoon but not before 2pm
         and can you also cancel my Monday appointment?"
  → voice-agent-primary (gpt-4o, ~400ms, $0.015)

Implementation: Vapi custom LLM endpoint includes the system prompt. LiteLLM can route based on prompt length or custom header x-complexity: simple|complex set by a lightweight classifier.


Available Models (Yes, You Can Use Any of These)

LiteLLM supports 100+ LLM providers through a single API. Here are the ones relevant to healthcare voice agents:

Tier 1: Premium (Complex Tool-Calling)

Model Provider Latency (TTFT) Cost (1M tokens) Best For
GPT-4o OpenAI ~350-700ms $2.50 / $10.00 English + Chinese tool-calling
Claude Sonnet 4.6 Anthropic ~300-600ms $3.00 / $15.00 Complex instructions, safety
Gemini 2.5 Pro Google ~400-800ms $1.25 / $10.00 Multilingual, long context
Claude Opus 4.6 Anthropic ~500-1000ms $15.00 / $75.00 Highest quality (overkill for voice)

Tier 2: Fast (Simple Turns, Confirmations)

Model Provider Latency (TTFT) Cost (1M tokens) Best For
GPT-4o-mini OpenAI ~200-400ms $0.15 / $0.60 Fast responses, simple logic
Gemini 2.5 Flash Google ~200-400ms $0.15 / $0.60 Multilingual fast tier
Claude Haiku 4.5 Anthropic ~200-400ms $0.80 / $4.00 Safety-focused fast tier

Tier 3: Budget (Starter Plan Clinics)

Model Provider Latency Cost (1M tokens) Best For
GPT-4.1-nano OpenAI ~150-300ms $0.10 / $0.40 Lowest cost, basic tool-calling

Tier 4: Specialized (Language-Specific)

Model Provider Latency Cost Best For
Qwen3-72B Self-hosted (vLLM) ~200-400ms Infra only Chinese (native), 119 languages
Mistral Large 2 Mistral ~300-500ms $2.00 / $6.00 French (native French company)
DeepSeek V3.1 DeepSeek ~300-500ms Very low Chinese (strong but tool-calling unstable)

Tier 5: Canadian Data Residency

Model Provider Region Notes
GPT-4o Azure OpenAI Canada East Same model, Canadian data residency
GPT-4o-mini Azure OpenAI Canada East Same model, Canadian data residency
Claude Sonnet AWS Bedrock ca-central-1 Anthropic via Bedrock in Canada
Qwen3-72B Self-hosted ca-central-1 Full control, GPU instance required

Canadian Data Residency

For PHIPA/PIPEDA compliance, Azure OpenAI (Canada East) and AWS Bedrock (ca-central-1) keep all data in Canada. Direct OpenAI API routes through US servers. At enterprise scale, route through Azure OpenAI or Bedrock for Canadian clinics that require data residency.


Per-Clinic Cost Tracking

How It Works

Every LLM request is tagged with metadata headers. LiteLLM tracks cost per tag automatically.

Vapi/LiveKit → POST /v1/chat/completions
Headers:
  x-clinic-id: clinic_abc_123
  x-call-id: call_xyz_789
  x-agent-name: booking-en

LiteLLM logs:
  {
    "clinic_id": "clinic_abc_123",
    "call_id": "call_xyz_789",
    "agent_name": "booking-en",
    "model": "gpt-4o",
    "input_tokens": 1847,
    "output_tokens": 312,
    "cost_usd": 0.0153,
    "latency_ms": 423,
    "timestamp": "2026-02-18T14:32:00Z"
  }

Cost Queries

# Total spend for a clinic this month
GET /spend/tags?tag=clinic_id:clinic_abc_123&start_date=2026-02-01
 { "total_spend": 42.17 }

# Breakdown by model
GET /spend/tags?tag=clinic_id:clinic_abc_123&group_by=model
 { "gpt-4o": 38.50, "gpt-4o-mini": 3.67 }

# Breakdown by agent
GET /spend/tags?tag=clinic_id:clinic_abc_123&group_by=agent_name
 { "booking-en": 18.20, "modification-en": 12.30, "patient-id-en": 8.00, ... }

# Cost per call
GET /spend/tags?tag=call_id:call_xyz_789
 { "total_spend": 0.14, "turns": 8, "model": "gpt-4o" }

Typical Cost Per Call

Call Type Turns Model Estimated Cost
Simple booking (confirm slot) 4-6 GPT-4o $0.08-0.12
Complex booking (multiple attempts) 8-12 GPT-4o $0.15-0.25
Reschedule 6-8 GPT-4o $0.10-0.18
Registration (new patient) 10-15 GPT-4o $0.20-0.35
Simple booking (budget) 4-6 GPT-4o-mini $0.01-0.02

SaaS Pricing Implications

With per-clinic cost visibility, VitaraVox can offer tiered pricing:

Plan LLM Model Budget Cap Price
Starter GPT-4o-mini $50/month $99/month
Standard GPT-4o $200/month $299/month
Enterprise GPT-4o + Claude failover $500/month $599/month
Premium GPT-4o + Canadian data residency $800/month $999/month

Integration with Vapi (Phase 4)

Vapi supports custom LLM endpoints. Instead of Vapi calling OpenAI directly, point it to your LiteLLM proxy:

# Vapi assistant config (in GitOps YAML frontmatter)
model:
  provider: custom-llm
  url: https://llm-proxy.internal.vitaravox.ca/v1
  model: voice-agent-primary
  headers:
    x-clinic-id: "{{clinicId}}"
    x-call-id: "{{callId}}"
    x-agent-name: "booking-en"

Vapi sends the LLM request to your proxy. Your proxy routes to the appropriate model, tracks cost, handles failover, and returns the response. Vapi never knows or cares which model actually answered.


Configuration Reference

Full LiteLLM Config

# litellm_config.yaml

model_list:
  # === PRIMARY: GPT-4o (complex tool-calling) ===
  - model_name: "voice-agent-primary"
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"
      timeout: 10
      max_retries: 1

  # === PRIMARY FAILOVER: Claude Sonnet ===
  - model_name: "voice-agent-primary"
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: "os.environ/ANTHROPIC_API_KEY"
      timeout: 10

  # === FAST: GPT-4o-mini (simple turns) ===
  - model_name: "voice-agent-fast"
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: "os.environ/OPENAI_API_KEY"
      timeout: 8

  # === FAST FAILOVER: Gemini Flash ===
  - model_name: "voice-agent-fast"
    litellm_params:
      model: vertex_ai/gemini-2.5-flash
      vertex_project: "vitaravox-prod"
      vertex_location: "northamerica-northeast1"
      timeout: 8

  # === CHINESE TRACK: GPT-4o (launch) ===
  - model_name: "voice-agent-zh"
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"
      timeout: 10

  # === CHINESE TRACK: Qwen3 (post-bake-off) ===
  - model_name: "voice-agent-zh"
    litellm_params:
      model: openai/qwen3-72b
      api_base: "http://10.0.1.50:8000/v1"
      timeout: 10

  # === BUDGET: GPT-4.1-nano (starter clinics) ===
  - model_name: "voice-agent-budget"
    litellm_params:
      model: openai/gpt-4.1-nano
      api_key: "os.environ/OPENAI_API_KEY"
      timeout: 8

  # === CANADIAN DATA RESIDENCY: Azure OpenAI ===
  - model_name: "voice-agent-canada"
    litellm_params:
      model: azure/gpt-4o
      api_base: "os.environ/AZURE_OPENAI_ENDPOINT"
      api_key: "os.environ/AZURE_OPENAI_KEY"
      api_version: "2024-12-01-preview"
      timeout: 10

router_settings:
  routing_strategy: "latency-based-routing"
  num_retries: 2
  retry_after: 0.5
  allowed_fails: 2
  cooldown_time: 60
  fallbacks:
    - voice-agent-primary: ["voice-agent-fast"]
    - voice-agent-zh: ["voice-agent-primary"]

litellm_settings:
  success_callback: ["langfuse"]     # optional: eval tracking
  cache: true                         # semantic caching (Redis-backed)
  cache_params:
    type: "redis"
    host: "os.environ/REDIS_HOST"
    port: 6379

general_settings:
  master_key: "os.environ/LITELLM_MASTER_KEY"
  database_url: "os.environ/DATABASE_URL"
  custom_auth: "custom_auth.auth_handler"    # per-clinic API key validation

Deployment

LiteLLM runs as a separate ECS Fargate service alongside the webhook server:

ECS Cluster
├── Service: vitara-admin-api (webhook server)
│   └── Task: 2-10 instances (auto-scaled)
├── Service: litellm-proxy (LLM gateway)
│   └── Task: 2 instances (always-on, internal ALB)
└── Service: otel-collector (observability)
    └── Task: 1 instance (sidecar pattern preferred)

Internal routing: webhook server calls http://litellm-proxy.internal:4000/v1/chat/completions. No public exposure. All traffic stays within VPC.