Phased Infrastructure Plan¶

Enterprise Migration Roadmap with Rationale¶

Date: February 18, 2026¶

Why This Order Matters¶

Each phase is a prerequisite for the next. The sequence is not arbitrary — it follows dependency chains. Attempting a later phase before completing its dependencies creates instability.

Phase 0 ──▶ Phase 1 ──▶ Phase 2 ──▶ Phase 3 ──▶ Phase 4 ──▶ Phase 5
Security     Redis       Managed      Container    LLM         Voice
Fixes        (shared     DB + Obs     + Auto-      Control     Pipeline
             state)                    Scale        Plane       Migration

FOUNDATION ─────────────────────────────── OPTIMIZATION ──── TRANSFORMATION
(Fixing gaps)                              (Revenue)          (Platform shift)

Phase 0: Security Critical Items¶

Timeline: Week 1-2 Cost: $0 (code changes only) Risk if skipped: Authentication bypass, PIPEDA violation, cross-clinic data access

Why Phase 0 Is First¶

The security advisory identified 6 critical findings. Scaling infrastructure before fixing these scales the vulnerabilities. A load balancer distributing traffic to a server with forgeable JWTs just means the attacker gets HA too.

Deliverables¶

Item	Finding	Fix
Rotate production secrets	JWT_SECRET = `vitara-dev-secret`, DB password = `vitara_dev_password`	Generate 256-bit random secrets, update .env
Enforce ENCRYPTION_KEY	Empty default in dev, no startup check	`.parse()` not `.safeParse()`, fail-fast in production
Webhook audit logging	Webhook operations NOT audited (Finding #19)	Log toolCallId, clinicId, action, patient affected
Idempotency tracking	No dedup on tool calls (Finding #10)	Track `toolCallId` in DB with 24h TTL, return cached result on retry
Validate metadata.clinicId	Untrusted metadata used directly (Finding #4)	Remove metadata override or restrict to admin with strong auth
Enforce webhook auth everywhere	Auth skipped in dev mode (Finding #2)	Require HMAC even in dev, use separate dev secret

Dependency Chain¶

Phase 0 completion ──▶ unlocks safe horizontal scaling (Phase 1+)
                  ──▶ unlocks compliance certification (any phase)

Phase 1: Add Redis¶

Timeline: Week 3-4 Cost: ~$12-25/month (ElastiCache cache.t4g.micro, HIPAA eligible) Risk if skipped: Cannot scale horizontally. Every subsequent phase is blocked.

Why Redis Is Second¶

Every subsequent phase depends on shared state. Two Fargate tasks behind an ALB with separate in-process Maps means:

Call A hits Instance 1 → patient lookup cached there
Next webhook for same call hits Instance 2 → cache miss → extra 4s SOAP call
Circuit breaker on Instance 1 is OPEN (OSCAR down) → Instance 2 doesn't know → sends request anyway
Advisory lock on Instance 1 → Instance 2 books the same slot → double booking

Redis is the prerequisite for horizontal scaling.

Architecture¶

┌──────────────────────────────────────────────────┐
│              ElastiCache (Redis 7+)               │
│              ca-central-1, encryption at rest      │
│                                                    │
│  ┌───────────────────────────────────────────────┐│
│  │  Key Namespace          │  TTL    │  Pattern  ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  call:{callId}          │  1 hour │  Hash     ││
│  │    agentId, patientId,  │         │           ││
│  │    clinicId, language,  │         │           ││
│  │    intent, turnCount    │         │           ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  cache:schedule:{provId}│  5 min  │  String   ││
│  │  :{date}                │         │  (JSON)   ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  cache:patient:{demoId} │  15 min │  Hash     ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  cache:providers:{cId}  │  1 hour │  String   ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  phone:{phoneNumberId}  │  1 hour │  String   ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  adapter:{clinicId}     │  5 min  │  String   ││
│  │  (serialized config)    │         │           ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  lock:slot:{provId}:    │  10 sec │  String   ││
│  │  {date}:{time}          │         │  (NX)     ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  circuit:{serviceName}  │  60 sec │  Hash     ││
│  │    state, failCount,    │         │           ││
│  │    lastFailure          │         │           ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  ratelimit:{scope}:     │  window │  String   ││
│  │  {identifier}:{window}  │         │  (INCR)   ││
│  ├─────────────────────────┼─────────┼───────────┤│
│  │  idempotent:{toolCallId}│  24 hr  │  String   ││
│  │  (cached tool result)   │         │  (JSON)   ││
│  └─────────────────────────┴─────────┴───────────┘│
└──────────────────────────────────────────────────┘

Migration Strategy¶

Mechanical refactor — replace each in-process Map with ioredis calls. Keep the same TTLs. Use ioredis cluster-compatible client from day one.

// Before:
const vapiPhoneCache = new Map<string, { number: string; expiresAt: number }>();

// After:
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
await redis.set(`phone:${phoneNumberId}`, number, 'EX', 3600);
const cached = await redis.get(`phone:${phoneNumberId}`);

Dependency Chain¶

Phase 1 completion ──▶ unlocks multi-instance deployment (Phase 3)
                  ──▶ unlocks distributed rate limiting (Phase 3)
                  ──▶ unlocks shared circuit breakers (immediate)
                  ──▶ unlocks idempotency dedup (immediate, supports Phase 0)

Phase 2: Managed Database + Observability¶

Timeline: Month 2 Cost: ~$150-250/month (RDS db.t4g.medium Multi-AZ + observability tier) Risk if skipped: Database is single point of failure. Debugging multi-instance deployments is impossible without distributed tracing.

Why Phase 2 Is Third¶

Once you have Redis and multiple instances, you need:

A database that doesn't die when the OCI instance reboots. RDS Multi-AZ gives automatic failover, automated backups, point-in-time recovery. This addresses every DR gap in the infrastructure advisory.
The ability to see what's happening across instances. You cannot debug a 2-instance deployment by SSH'ing to each box and tailing PM2 logs. Distributed tracing shows you the full lifecycle of a tool call across the ALB → Instance → Redis → OSCAR SOAP chain.

Database Migration¶

Current:
  PostgreSQL 16 (local, single instance, no replication)
  Backup: pg_dump daily, 14-day retention, local disk
  Password: vitara_dev_password (!)

Target:
  RDS PostgreSQL 16 (Multi-AZ, ca-central-1)
  ├── Automated backups: 35-day retention
  ├── Point-in-time recovery: 5-minute granularity
  ├── Encryption at rest: AWS KMS
  ├── Connection pooling: RDS Proxy
  ├── Read replica: for admin dashboard queries
  └── Monitoring: Enhanced Monitoring + Performance Insights

Migration path: pg_dump from local → pg_restore to RDS. Update DATABASE_URL in Secrets Manager. Zero data loss.

Observability Stack¶

┌─────────────────────────────────────────────────────────┐
│                    Observability Layer                    │
│                                                          │
│  Application Code                                        │
│  ┌─────────────────────────────────────────────────────┐│
│  │  OpenTelemetry SDK (Node.js)                        ││
│  │                                                      ││
│  │  Traces:                                             ││
│  │    span: vapi.tool_call                              ││
│  │      attributes:                                     ││
│  │        clinicId, callId, toolName, agentName          ││
│  │        gen_ai.request.model (gpt-4o)                  ││
│  │        gen_ai.usage.input_tokens                      ││
│  │        gen_ai.usage.output_tokens                     ││
│  │                                                      ││
│  │  Metrics:                                             ││
│  │    tool_call_duration_ms (histogram)                   ││
│  │    booking_success_total (counter)                     ││
│  │    circuit_breaker_state (gauge)                       ││
│  │    oscar_soap_latency_p95 (summary)                    ││
│  │                                                      ││
│  │  Logs:                                                ││
│  │    Pino → OTel log bridge → same pipeline             ││
│  └────────────────────┬────────────────────────────────┘│
│                        │                                  │
│                        ▼                                  │
│  ┌─────────────────────────────────────────────────────┐│
│  │  OTel Collector (sidecar per Fargate task)          ││
│  │    - PHI field scrubbing (before export)            ││
│  │    - Sampling: 100% errors, 10% success             ││
│  │    - Batch export (reduce cost)                     ││
│  └────────────────────┬────────────────────────────────┘│
│                        │                                  │
│              ┌─────────┴──────────┐                      │
│              ▼                    ▼                       │
│  ┌──────────────────┐  ┌──────────────────┐             │
│  │  Option A:        │  │  Option B:        │             │
│  │  Datadog          │  │  Self-Hosted      │             │
│  │  (HIPAA tier)     │  │  Grafana Stack    │             │
│  │                   │  │                   │             │
│  │  - BAA available  │  │  - Loki (logs)    │             │
│  │  - Sensitive Data │  │  - Tempo (traces) │             │
│  │    Scanner (PHI)  │  │  - Prometheus     │             │
│  │  - ~$15/host/mo   │  │    (metrics)     │             │
│  └──────────────────┘  │  - Full control   │             │
│                         │  - ~$50/mo infra   │             │
│                         └──────────────────┘             │
└─────────────────────────────────────────────────────────┘

Dependency Chain¶

Phase 2 completion ──▶ unlocks auto-scaling with visibility (Phase 3)
                  ──▶ unlocks database failover (immediate)
                  ──▶ unlocks PIPEDA audit compliance (immediate)
                  ──▶ unlocks per-tool latency optimization (ongoing)

Phase 3: Containerize + Auto-Scale¶

Timeline: Month 2-3 Cost: ~$100-300/month (Fargate tasks + ALB, scales with traffic) Risk if skipped: Single point of failure remains. No zero-downtime deployments. Cannot handle clinic growth.

Why Phase 3 Is Fourth¶

This is where VitaraVox actually becomes highly available — and where compute migrates from OCI ARM (Toronto) to AWS ECS Fargate (ca-central-1). The current OCI instance is adequate for pilot but lacks the managed services ecosystem (ElastiCache, Secrets Manager, ALB auto-scaling, X-Ray) needed at enterprise scale. The dev OSCAR instance is already on AWS EC2 (ca-central-1), so the network path to OSCAR stays within AWS.

Phase 3 only works if:

Phase 1 (Redis) provides shared state across instances
Phase 2 (RDS) provides a database that survives instance failures
Phase 2 (Observability) provides visibility into distributed behavior

Architecture¶

┌──────────────────────────────────────────────────────────────┐
│                     ECS Fargate Cluster                       │
│                     (ca-central-1)                            │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  Application Load Balancer (ALB)                        │  │
│  │    - HTTPS termination (ACM certificate)                │  │
│  │    - Health check: GET /health every 30s                │  │
│  │    - Deregistration delay: 30s (drain in-flight calls)  │  │
│  │    - Sticky sessions: OFF (state is in Redis)           │  │
│  └───────────────────────┬────────────────────────────────┘  │
│                           │                                   │
│          ┌────────────────┼────────────────┐                 │
│          ▼                ▼                ▼                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐         │
│  │  Task 1       │ │  Task 2       │ │  Task N       │         │
│  │               │ │               │ │  (auto-scaled)│         │
│  │  vitara-api   │ │  vitara-api   │ │               │         │
│  │  container    │ │  container    │ │  Scaling rules:│        │
│  │               │ │               │ │  min: 2        │        │
│  │  256 CPU      │ │  256 CPU      │ │  max: 10       │        │
│  │  512 MB RAM   │ │  512 MB RAM   │ │  target: 70%   │        │
│  │               │ │               │ │  CPU            │        │
│  │  OTel sidecar │ │  OTel sidecar │ │                │        │
│  └──────────────┘ └──────────────┘ └──────────────┘         │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  Deployment Strategy: Rolling Update                    │  │
│  │    - minimumHealthyPercent: 100%                        │  │
│  │    - maximumPercent: 200%                               │  │
│  │    - New tasks start → pass health check → old drain    │  │
│  │    - Zero downtime (no more PM2 restart interruptions)  │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  Secrets: AWS Secrets Manager                           │  │
│  │    - JWT_SECRET, ENCRYPTION_KEY, VAPI_API_KEY           │  │
│  │    - OSCAR credentials (per-clinic, from RDS)           │  │
│  │    - Automatic rotation support                         │  │
│  │    - IAM task role access (no .env files)               │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘

Dockerfile¶

# Multi-stage build
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npx tsc

FROM node:18-alpine AS runtime
WORKDIR /app
RUN addgroup -g 1001 vitara && adduser -u 1001 -G vitara -s /bin/sh -D vitara
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER vitara
EXPOSE 3002
HEALTHCHECK CMD wget -q --spider http://localhost:3002/health || exit 1
CMD ["node", "dist/index.js"]

What This Solves¶

Problem (from advisory)	Solution
6666 PM2 restarts in 24h	Container health checks auto-replace unhealthy tasks
Single point of failure	Minimum 2 tasks, Multi-AZ
No zero-downtime deploy	Rolling update strategy
.env secrets on disk	Secrets Manager with IAM roles
Manual deployment process	ECR push → ECS deploy (CI/CD)
PM2 not version-controlled	Fargate task definition in IaC
OCI lacks managed AI/voice services	AWS ecosystem: ElastiCache, Secrets Mgr, X-Ray, HealthLake

Dependency Chain¶

Phase 3 completion ──▶ unlocks multi-clinic scaling (immediate)
                  ──▶ unlocks auto-scaling under load (immediate)
                  ──▶ unlocks CI/CD pipeline (immediate)
                  ──▶ unlocks LLM proxy deployment (Phase 4)

Phase 4: LLM Control Plane¶

Timeline: Month 3 Cost: ~$50-100/month (Fargate task for proxy + LLM API costs pass-through) Risk if skipped: No cost visibility per clinic. No model failover. No A/B testing capability. Pricing model is guesswork.

Why Phase 4 Is Fifth¶

By Phase 4, VitaraVox is running at scale with multiple clinics, auto-scaling infrastructure, and full observability. Now the business questions become:

"What does Clinic A cost us per call vs. Clinic B?"
"Can we route simple confirmations to GPT-4o-mini and save 85%?"
"If OpenAI has an outage, do all 50 clinics go dark?"
"Is Claude Sonnet better than GPT-4o for our Chinese track?"

This is the revenue optimization layer. See the LLM Control Plane document for full specifications.

Dependency Chain¶

Phase 4 completion ──▶ unlocks per-clinic SaaS pricing (business)
                  ──▶ unlocks model A/B testing (optimization)
                  ──▶ unlocks provider failover (resilience)
                  ──▶ unlocks voice pipeline migration (Phase 5)

Phase 5: Voice Pipeline Migration¶

Timeline: Month 4-6 Cost: ~$500-1500/month infrastructure (eliminates ~$5000/month Vapi cost at scale) Risk if skipped: Vapi vendor lock-in, 50-100ms platform tax, cannot use best-in-class providers for underserved languages

Why Phase 5 Is Last¶

This is the riskiest and highest-effort change. Replacing the entire voice transport layer means rebuilding:

PSTN telephony (Telnyx SIP integration)
Real-time STT streaming with interruption handling
WebRTC transport for web/mobile
Voice activity detection and endpointing
Multi-agent handoff without audible gaps

Everything else must be rock-solid before attempting this. A LiveKit migration on a fragile infrastructure is building on sand.

Migration Path¶

Per the multilingual voice stack advisory:

Phase	Languages	Timeline
5a	English only (validate against Vapi baseline)	Month 4
5b	+ Mandarin (Deepgram zh + Qwen3-TTS)	Month 4-5
5c	+ Cantonese (Google Chirp 3 + Qwen3-TTS) — hardest track	Month 5
5d	+ Punjabi (Sarvam) + French-CA (Deepgram + Azure HD)	Month 5-6
5e	Full regression, PHIPA audit, production cutover	Month 6

Keep Vapi v3.0 running in parallel throughout as production fallback.

Phase Summary¶

Phase	Timeline	Monthly Cost	What It Solves
0: Security	Week 1-2	$0	Auth bypass, PIPEDA violation, cross-clinic access
1: Redis	Week 3-4	$12-25	Horizontal scaling prerequisite, distributed state
2: Managed DB + Obs	Month 2	$150-250	Database HA, disaster recovery, visibility
3: Containers	Month 2-3	$100-300	Auto-scaling, zero-downtime, HA
4: LLM Proxy	Month 3	$50-100	Cost tracking, failover, A/B testing
5: Voice Migration	Month 4-6	$500-1500	Eliminate Vapi lock-in, best-in-class providers
Total at steady state		~$800-2000	Full enterprise stack (saves $3-4K/mo vs. Vapi at scale)