Phased Infrastructure Plan¶
Enterprise Migration Roadmap with Rationale¶
Date: February 18, 2026¶
Why This Order Matters¶
Each phase is a prerequisite for the next. The sequence is not arbitrary — it follows dependency chains. Attempting a later phase before completing its dependencies creates instability.
Phase 0 ──▶ Phase 1 ──▶ Phase 2 ──▶ Phase 3 ──▶ Phase 4 ──▶ Phase 5
Security Redis Managed Container LLM Voice
Fixes (shared DB + Obs + Auto- Control Pipeline
state) Scale Plane Migration
FOUNDATION ─────────────────────────────── OPTIMIZATION ──── TRANSFORMATION
(Fixing gaps) (Revenue) (Platform shift)
Phase 0: Security Critical Items¶
Timeline: Week 1-2 Cost: $0 (code changes only) Risk if skipped: Authentication bypass, PIPEDA violation, cross-clinic data access
Why Phase 0 Is First¶
The security advisory identified 6 critical findings. Scaling infrastructure before fixing these scales the vulnerabilities. A load balancer distributing traffic to a server with forgeable JWTs just means the attacker gets HA too.
Deliverables¶
| Item | Finding | Fix |
|---|---|---|
| Rotate production secrets | JWT_SECRET = vitara-dev-secret, DB password = vitara_dev_password |
Generate 256-bit random secrets, update .env |
| Enforce ENCRYPTION_KEY | Empty default in dev, no startup check | .parse() not .safeParse(), fail-fast in production |
| Webhook audit logging | Webhook operations NOT audited (Finding #19) | Log toolCallId, clinicId, action, patient affected |
| Idempotency tracking | No dedup on tool calls (Finding #10) | Track toolCallId in DB with 24h TTL, return cached result on retry |
| Validate metadata.clinicId | Untrusted metadata used directly (Finding #4) | Remove metadata override or restrict to admin with strong auth |
| Enforce webhook auth everywhere | Auth skipped in dev mode (Finding #2) | Require HMAC even in dev, use separate dev secret |
Dependency Chain¶
Phase 0 completion ──▶ unlocks safe horizontal scaling (Phase 1+)
──▶ unlocks compliance certification (any phase)
Phase 1: Add Redis¶
Timeline: Week 3-4
Cost: ~$12-25/month (ElastiCache cache.t4g.micro, HIPAA eligible)
Risk if skipped: Cannot scale horizontally. Every subsequent phase is blocked.
Why Redis Is Second¶
Every subsequent phase depends on shared state. Two Fargate tasks behind an ALB with separate in-process Maps means:
- Call A hits Instance 1 → patient lookup cached there
- Next webhook for same call hits Instance 2 → cache miss → extra 4s SOAP call
- Circuit breaker on Instance 1 is OPEN (OSCAR down) → Instance 2 doesn't know → sends request anyway
- Advisory lock on Instance 1 → Instance 2 books the same slot → double booking
Redis is the prerequisite for horizontal scaling.
Architecture¶
┌──────────────────────────────────────────────────┐
│ ElastiCache (Redis 7+) │
│ ca-central-1, encryption at rest │
│ │
│ ┌───────────────────────────────────────────────┐│
│ │ Key Namespace │ TTL │ Pattern ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ call:{callId} │ 1 hour │ Hash ││
│ │ agentId, patientId, │ │ ││
│ │ clinicId, language, │ │ ││
│ │ intent, turnCount │ │ ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ cache:schedule:{provId}│ 5 min │ String ││
│ │ :{date} │ │ (JSON) ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ cache:patient:{demoId} │ 15 min │ Hash ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ cache:providers:{cId} │ 1 hour │ String ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ phone:{phoneNumberId} │ 1 hour │ String ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ adapter:{clinicId} │ 5 min │ String ││
│ │ (serialized config) │ │ ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ lock:slot:{provId}: │ 10 sec │ String ││
│ │ {date}:{time} │ │ (NX) ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ circuit:{serviceName} │ 60 sec │ Hash ││
│ │ state, failCount, │ │ ││
│ │ lastFailure │ │ ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ ratelimit:{scope}: │ window │ String ││
│ │ {identifier}:{window} │ │ (INCR) ││
│ ├─────────────────────────┼─────────┼───────────┤│
│ │ idempotent:{toolCallId}│ 24 hr │ String ││
│ │ (cached tool result) │ │ (JSON) ││
│ └─────────────────────────┴─────────┴───────────┘│
└──────────────────────────────────────────────────┘
Migration Strategy¶
Mechanical refactor — replace each in-process Map with ioredis calls. Keep the same TTLs. Use ioredis cluster-compatible client from day one.
// Before:
const vapiPhoneCache = new Map<string, { number: string; expiresAt: number }>();
// After:
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
await redis.set(`phone:${phoneNumberId}`, number, 'EX', 3600);
const cached = await redis.get(`phone:${phoneNumberId}`);
Dependency Chain¶
Phase 1 completion ──▶ unlocks multi-instance deployment (Phase 3)
──▶ unlocks distributed rate limiting (Phase 3)
──▶ unlocks shared circuit breakers (immediate)
──▶ unlocks idempotency dedup (immediate, supports Phase 0)
Phase 2: Managed Database + Observability¶
Timeline: Month 2
Cost: ~$150-250/month (RDS db.t4g.medium Multi-AZ + observability tier)
Risk if skipped: Database is single point of failure. Debugging multi-instance deployments is impossible without distributed tracing.
Why Phase 2 Is Third¶
Once you have Redis and multiple instances, you need:
-
A database that doesn't die when the OCI instance reboots. RDS Multi-AZ gives automatic failover, automated backups, point-in-time recovery. This addresses every DR gap in the infrastructure advisory.
-
The ability to see what's happening across instances. You cannot debug a 2-instance deployment by SSH'ing to each box and tailing PM2 logs. Distributed tracing shows you the full lifecycle of a tool call across the ALB → Instance → Redis → OSCAR SOAP chain.
Database Migration¶
Current:
PostgreSQL 16 (local, single instance, no replication)
Backup: pg_dump daily, 14-day retention, local disk
Password: vitara_dev_password (!)
Target:
RDS PostgreSQL 16 (Multi-AZ, ca-central-1)
├── Automated backups: 35-day retention
├── Point-in-time recovery: 5-minute granularity
├── Encryption at rest: AWS KMS
├── Connection pooling: RDS Proxy
├── Read replica: for admin dashboard queries
└── Monitoring: Enhanced Monitoring + Performance Insights
Migration path: pg_dump from local → pg_restore to RDS. Update DATABASE_URL in Secrets Manager. Zero data loss.
Observability Stack¶
┌─────────────────────────────────────────────────────────┐
│ Observability Layer │
│ │
│ Application Code │
│ ┌─────────────────────────────────────────────────────┐│
│ │ OpenTelemetry SDK (Node.js) ││
│ │ ││
│ │ Traces: ││
│ │ span: vapi.tool_call ││
│ │ attributes: ││
│ │ clinicId, callId, toolName, agentName ││
│ │ gen_ai.request.model (gpt-4o) ││
│ │ gen_ai.usage.input_tokens ││
│ │ gen_ai.usage.output_tokens ││
│ │ ││
│ │ Metrics: ││
│ │ tool_call_duration_ms (histogram) ││
│ │ booking_success_total (counter) ││
│ │ circuit_breaker_state (gauge) ││
│ │ oscar_soap_latency_p95 (summary) ││
│ │ ││
│ │ Logs: ││
│ │ Pino → OTel log bridge → same pipeline ││
│ └────────────────────┬────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐│
│ │ OTel Collector (sidecar per Fargate task) ││
│ │ - PHI field scrubbing (before export) ││
│ │ - Sampling: 100% errors, 10% success ││
│ │ - Batch export (reduce cost) ││
│ └────────────────────┬────────────────────────────────┘│
│ │ │
│ ┌─────────┴──────────┐ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Option A: │ │ Option B: │ │
│ │ Datadog │ │ Self-Hosted │ │
│ │ (HIPAA tier) │ │ Grafana Stack │ │
│ │ │ │ │ │
│ │ - BAA available │ │ - Loki (logs) │ │
│ │ - Sensitive Data │ │ - Tempo (traces) │ │
│ │ Scanner (PHI) │ │ - Prometheus │ │
│ │ - ~$15/host/mo │ │ (metrics) │ │
│ └──────────────────┘ │ - Full control │ │
│ │ - ~$50/mo infra │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────┘
Dependency Chain¶
Phase 2 completion ──▶ unlocks auto-scaling with visibility (Phase 3)
──▶ unlocks database failover (immediate)
──▶ unlocks PIPEDA audit compliance (immediate)
──▶ unlocks per-tool latency optimization (ongoing)
Phase 3: Containerize + Auto-Scale¶
Timeline: Month 2-3 Cost: ~$100-300/month (Fargate tasks + ALB, scales with traffic) Risk if skipped: Single point of failure remains. No zero-downtime deployments. Cannot handle clinic growth.
Why Phase 3 Is Fourth¶
This is where VitaraVox actually becomes highly available — and where compute migrates from OCI ARM (Toronto) to AWS ECS Fargate (ca-central-1). The current OCI instance is adequate for pilot but lacks the managed services ecosystem (ElastiCache, Secrets Manager, ALB auto-scaling, X-Ray) needed at enterprise scale. The dev OSCAR instance is already on AWS EC2 (ca-central-1), so the network path to OSCAR stays within AWS.
Phase 3 only works if:
- Phase 1 (Redis) provides shared state across instances
- Phase 2 (RDS) provides a database that survives instance failures
- Phase 2 (Observability) provides visibility into distributed behavior
Architecture¶
┌──────────────────────────────────────────────────────────────┐
│ ECS Fargate Cluster │
│ (ca-central-1) │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Application Load Balancer (ALB) │ │
│ │ - HTTPS termination (ACM certificate) │ │
│ │ - Health check: GET /health every 30s │ │
│ │ - Deregistration delay: 30s (drain in-flight calls) │ │
│ │ - Sticky sessions: OFF (state is in Redis) │ │
│ └───────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Task 1 │ │ Task 2 │ │ Task N │ │
│ │ │ │ │ │ (auto-scaled)│ │
│ │ vitara-api │ │ vitara-api │ │ │ │
│ │ container │ │ container │ │ Scaling rules:│ │
│ │ │ │ │ │ min: 2 │ │
│ │ 256 CPU │ │ 256 CPU │ │ max: 10 │ │
│ │ 512 MB RAM │ │ 512 MB RAM │ │ target: 70% │ │
│ │ │ │ │ │ CPU │ │
│ │ OTel sidecar │ │ OTel sidecar │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Deployment Strategy: Rolling Update │ │
│ │ - minimumHealthyPercent: 100% │ │
│ │ - maximumPercent: 200% │ │
│ │ - New tasks start → pass health check → old drain │ │
│ │ - Zero downtime (no more PM2 restart interruptions) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Secrets: AWS Secrets Manager │ │
│ │ - JWT_SECRET, ENCRYPTION_KEY, VAPI_API_KEY │ │
│ │ - OSCAR credentials (per-clinic, from RDS) │ │
│ │ - Automatic rotation support │ │
│ │ - IAM task role access (no .env files) │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Dockerfile¶
# Multi-stage build
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npx tsc
FROM node:18-alpine AS runtime
WORKDIR /app
RUN addgroup -g 1001 vitara && adduser -u 1001 -G vitara -s /bin/sh -D vitara
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER vitara
EXPOSE 3002
HEALTHCHECK CMD wget -q --spider http://localhost:3002/health || exit 1
CMD ["node", "dist/index.js"]
What This Solves¶
| Problem (from advisory) | Solution |
|---|---|
| 6666 PM2 restarts in 24h | Container health checks auto-replace unhealthy tasks |
| Single point of failure | Minimum 2 tasks, Multi-AZ |
| No zero-downtime deploy | Rolling update strategy |
| .env secrets on disk | Secrets Manager with IAM roles |
| Manual deployment process | ECR push → ECS deploy (CI/CD) |
| PM2 not version-controlled | Fargate task definition in IaC |
| OCI lacks managed AI/voice services | AWS ecosystem: ElastiCache, Secrets Mgr, X-Ray, HealthLake |
Dependency Chain¶
Phase 3 completion ──▶ unlocks multi-clinic scaling (immediate)
──▶ unlocks auto-scaling under load (immediate)
──▶ unlocks CI/CD pipeline (immediate)
──▶ unlocks LLM proxy deployment (Phase 4)
Phase 4: LLM Control Plane¶
Timeline: Month 3 Cost: ~$50-100/month (Fargate task for proxy + LLM API costs pass-through) Risk if skipped: No cost visibility per clinic. No model failover. No A/B testing capability. Pricing model is guesswork.
Why Phase 4 Is Fifth¶
By Phase 4, VitaraVox is running at scale with multiple clinics, auto-scaling infrastructure, and full observability. Now the business questions become:
- "What does Clinic A cost us per call vs. Clinic B?"
- "Can we route simple confirmations to GPT-4o-mini and save 85%?"
- "If OpenAI has an outage, do all 50 clinics go dark?"
- "Is Claude Sonnet better than GPT-4o for our Chinese track?"
This is the revenue optimization layer. See the LLM Control Plane document for full specifications.
Dependency Chain¶
Phase 4 completion ──▶ unlocks per-clinic SaaS pricing (business)
──▶ unlocks model A/B testing (optimization)
──▶ unlocks provider failover (resilience)
──▶ unlocks voice pipeline migration (Phase 5)
Phase 5: Voice Pipeline Migration¶
Timeline: Month 4-6 Cost: ~$500-1500/month infrastructure (eliminates ~$5000/month Vapi cost at scale) Risk if skipped: Vapi vendor lock-in, 50-100ms platform tax, cannot use best-in-class providers for underserved languages
Why Phase 5 Is Last¶
This is the riskiest and highest-effort change. Replacing the entire voice transport layer means rebuilding:
- PSTN telephony (Telnyx SIP integration)
- Real-time STT streaming with interruption handling
- WebRTC transport for web/mobile
- Voice activity detection and endpointing
- Multi-agent handoff without audible gaps
Everything else must be rock-solid before attempting this. A LiveKit migration on a fragile infrastructure is building on sand.
Migration Path¶
Per the multilingual voice stack advisory:
| Phase | Languages | Timeline |
|---|---|---|
| 5a | English only (validate against Vapi baseline) | Month 4 |
| 5b | + Mandarin (Deepgram zh + Qwen3-TTS) | Month 4-5 |
| 5c | + Cantonese (Google Chirp 3 + Qwen3-TTS) — hardest track | Month 5 |
| 5d | + Punjabi (Sarvam) + French-CA (Deepgram + Azure HD) | Month 5-6 |
| 5e | Full regression, PHIPA audit, production cutover | Month 6 |
Keep Vapi v3.0 running in parallel throughout as production fallback.
Phase Summary¶
| Phase | Timeline | Monthly Cost | What It Solves |
|---|---|---|---|
| 0: Security | Week 1-2 | $0 | Auth bypass, PIPEDA violation, cross-clinic access |
| 1: Redis | Week 3-4 | $12-25 | Horizontal scaling prerequisite, distributed state |
| 2: Managed DB + Obs | Month 2 | $150-250 | Database HA, disaster recovery, visibility |
| 3: Containers | Month 2-3 | $100-300 | Auto-scaling, zero-downtime, HA |
| 4: LLM Proxy | Month 3 | $50-100 | Cost tracking, failover, A/B testing |
| 5: Voice Migration | Month 4-6 | $500-1500 | Eliminate Vapi lock-in, best-in-class providers |
| Total at steady state | ~$800-2000 | Full enterprise stack (saves $3-4K/mo vs. Vapi at scale) |