Skip to content

PROD Monorepo Cutover + DID routing_environment Rollout

Status: ✅ COMPLETED 2026-04-17 Owner: Hari Backup retained at: /root/cutover-backup-1776420747 on prod VPS (keep until 2026-05-01) Outcome: Clean cutover. Zero customer downtime. All four prod services (api, editor, workflow-engine, pipecat-flow) migrated to monorepo deploys. Staging forwarding via WireGuard live for AstraPrivate DIDs. Rollback time estimate if needed: ~2 min (revert /opt/ from backup, pm2 restart)

This runbook below documents the plan that was executed. Kept as the reference template for the next platform-level cutover.

This runbook covers the first real deploy through the monorepo CI/CD on production — alongside the schema change required by the DID routing-environment feature merged in PR #8.

Why this is high-stakes

  1. First real monorepo deploy on prod. Per-repo CI/CD on prod was disabled when we set up the monorepo runner. This will be the first push that actually rsyncs the new monorepo into /opt/<app> on prod.
  2. Schema migration required. PR #8's generator queries did_numbers.routing_environment. Without the ALTER TABLE, every gateway-routing regeneration on prod will throw Unknown column.
  3. Auto-trigger on merge. All four deploy-*-prod.yml workflows fire on push: branches: [main] with paths: ['<app>/**']. The moment the merge lands, prod runner picks it up and rsyncs.

Therefore the schema must be in place BEFORE the merge, and we backup before merging.

Pre-flight checks (run before anything destructive)

# 1. Prod monorepo runner is online + listening
ssh root@89.116.31.109 'systemctl status actions.runner.astradial-astradial-platform.* --no-pager | head -5'

# 2. Old per-repo runners are NOT running (would also pick up workflows)
ssh root@89.116.31.109 'systemctl list-units --type=service --state=running | grep actions.runner'
# Expect: only the astradial-platform runner; if old per-repo runners are running, stop them first.

# 3. Prod self-hosted runner has 'production' label
# Check via: GitHub repo → Settings → Actions → Runners → astradial-platform runner → labels include "production"

# 4. Prod Asterisk + AMI healthy
ssh root@89.116.31.109 'pm2 status | grep -E "astrapbx|workflow-engine|editor|pipecat"'
# All must be online

# 5. cloud-endpoint-stage PJSIP endpoint exists on prod (so staging-flagged DIDs can forward)
ssh root@89.116.31.109 'asterisk -rx "pjsip show endpoint cloud-endpoint-stage"'
# Must show Status: Avail (qualify reachable)

# 6. WireGuard tunnel prod ↔ staging is up
ssh root@89.116.31.109 'wg show wg0 | grep -A 3 "10.10.10.3"'
# Last handshake must be recent (within minutes)

If any pre-flight fails — STOP. Don't merge.

Backup (mandatory)

ssh root@89.116.31.109 << 'EOF'
TS=$(date +%s)
mkdir -p /root/cutover-backup-$TS

# Application code backup
for app in astrapbx pipecat-flow pipecat-flow-editor workflow-engine; do
  tar -czf /root/cutover-backup-$TS/$app.tar.gz \
    --exclude='node_modules' --exclude='.next' --exclude='.venv' \
    -C /opt $app
done

# DB dump
mysqldump -u root pbx_api_db > /root/cutover-backup-$TS/pbx_api_db.sql

# Asterisk configs
tar -czf /root/cutover-backup-$TS/asterisk-etc.tar.gz -C /etc asterisk

ls -lh /root/cutover-backup-$TS/
echo "Backup at: /root/cutover-backup-$TS"
EOF

Note the backup directory path — you'll need it for rollback.

Step 1 — Apply the schema migration on prod

This MUST run before the merge so the new code finds the column when it boots.

ssh root@89.116.31.109 << 'EOF'
mysql -u root pbx_api_db <<SQL
ALTER TABLE did_numbers
  ADD COLUMN routing_environment ENUM('prod','staging','oss')
  NOT NULL DEFAULT 'prod';
SQL

# Verify
mysql -u root pbx_api_db -e "DESCRIBE did_numbers" | grep routing_env
EOF

Expected output:

routing_environment   enum('prod','staging','oss')   NO   prod

All existing rows default to prod → no behaviour change yet.

Step 2 — Merge staging → main

After pre-flight + backup + schema migration are all GREEN:

cd /Users/hari/AstradialDevelopment/astradial-platform
gh pr create --base main --head staging \
  --title "Promote: monorepo cutover + did routing_environment + caller-ID + recording" \
  --body "First real prod deploy of the monorepo. Includes:
- Caller ID validation API + dialplan
- Recording defaults to ON
- CDR dedupe
- Admin impersonation + sidebar refresh + DID admin proxy
- Sidebar refresh fixes
- did_numbers.routing_environment column (schema already applied to prod)
- New /api/v1/admin/regenerate-gateway endpoint
- dev-deploy.sh + workflow docs

Schema migration was already applied to prod's pbx_api_db.
Backup at: /root/cutover-backup-<TS> on prod VPS."

# Confirm with the maintainer who owns prod, then:
gh pr merge --merge --delete-branch  # use merge-commit, NOT squash, to preserve history

The 4 deploy-*-prod.yml workflows will auto-trigger.

Step 3 — Watch the deploys

# Open in another terminal
cd /Users/hari/AstradialDevelopment/astradial-platform
gh run list --branch main --limit 6 --workflow='Deploy API to production'
gh run watch  # watches the latest run

Expected sequence (each ~30-60s): - Deploy API to production ✅ - Deploy Workflow to production ✅ - Deploy Pipecat to production ✅ - Deploy Editor to production ✅ (slowest, Next.js build ~3-5 min)

If any fails — see rollback below.

Step 4 — Smoke tests on prod

After ALL four deploys succeed:

# 1. Services up
ssh root@89.116.31.109 'pm2 status'  # all online, recent uptime

# 2. API responds + new endpoint exists
ssh root@89.116.31.109 'curl -s -X POST -H "X-Internal-Key: $(grep INTERNAL_API_KEY /opt/astrapbx/.env | cut -d= -f2)" http://localhost:8000/api/v1/admin/regenerate-gateway | head -c 200'
# Expect: {"success":true,"didCount":N,"orgCount":M}

# 3. Asterisk dispatcher regenerated correctly
ssh root@89.116.31.109 'grep -A 3 "AstraPrivate" /etc/asterisk/ext_tata_gateway.conf'
# Expect: 78001 + 78003 still local Goto (because they're routing_environment='prod' until Step 5)

# 4. Make a test inbound call to a non-affected DID (e.g., GrandEstancia 08065978002) — should still work as before

# 5. Logs are clean
ssh root@89.116.31.109 'pm2 logs astrapbx --lines 30 --nostream | grep -iE "error|warn"'

Step 5 — Move 78001 + 78003 to staging routing

ONLY after Step 4 smoke tests pass:

ssh root@89.116.31.109 << 'EOF'
mysql -u root pbx_api_db -e "
  UPDATE did_numbers
  SET routing_environment='staging'
  WHERE number IN ('+918065978001','+918065978003');
"

mysql -u root pbx_api_db -e "
  SELECT number, org_id, pool_status, routing_environment
  FROM did_numbers
  WHERE number LIKE '%978001%' OR number LIKE '%978003%';
"
EOF

Then regenerate the dispatcher:

ssh root@89.116.31.109 \
  'curl -s -X POST -H "X-Internal-Key: $(grep INTERNAL_API_KEY /opt/astrapbx/.env | cut -d= -f2)" http://localhost:8000/api/v1/admin/regenerate-gateway'

Verify:

ssh root@89.116.31.109 'grep -A 5 "AstraPrivate" /etc/asterisk/ext_tata_gateway.conf'

Expected:

; === AstraPrivate (org_mna9x47k__) ===
exten => 918065978001,1,NoOp(DID 918065978001 -> staging cloud)
exten => 918065978001,n,Dial(PJSIP/918065978001@cloud-endpoint-stage,120)
exten => 918065978001,n,Hangup()
exten => 918065978003,1,NoOp(DID 918065978003 -> staging cloud)
exten => 918065978003,n,Dial(PJSIP/918065978003@cloud-endpoint-stage,120)
exten => 918065978003,n,Hangup()

Step 6 — Verify staging inbound works

From your mobile dial +91 80659 78001. Expected: - macOS Telephone.app (registered as ext 1001 on stagesip.astradial.com) rings - You answer - Two-way audio works - Call appears in stagepbx call history

Rollback (if any step fails)

Rollback application code

ssh root@89.116.31.109 << 'EOF'
TS_DIR=/root/cutover-backup-<TS>  # the backup dir from the Backup step

for app in astrapbx pipecat-flow pipecat-flow-editor workflow-engine; do
  rm -rf /opt/$app
  mkdir -p /opt/$app
  tar -xzf $TS_DIR/$app.tar.gz -C /opt
  pm2 restart $app --update-env
done

pm2 status
EOF

Rollback schema (only if absolutely necessary — column is no-op for old code)

ssh root@89.116.31.109 'mysql -u root pbx_api_db -e "ALTER TABLE did_numbers DROP COLUMN routing_environment;"'

Rollback dispatcher (revert 78001/78003 to local routing)

ssh root@89.116.31.109 << 'EOF'
mysql -u root pbx_api_db -e "
  UPDATE did_numbers SET routing_environment='prod'
  WHERE number IN ('+918065978001','+918065978003');
"
curl -s -X POST -H "X-Internal-Key: $(grep INTERNAL_API_KEY /opt/astrapbx/.env | cut -d= -f2)" \
  http://localhost:8000/api/v1/admin/regenerate-gateway
EOF

Rollback the merge (last resort)

cd /Users/hari/AstradialDevelopment/astradial-platform
git revert -m 1 <merge-commit-sha>
git push origin main
# This re-fires the prod deploys with the OLD code.

Post-cutover verification

After 24-48h of stable operation:

  • [ ] No new errors in pm2 logs astrapbx --lines 200 | grep -iE "error|warn"
  • [ ] No customer-reported call failures
  • [ ] Outbound recording files showing up in /var/spool/asterisk/monitor/
  • [ ] CDR shows no duplicate rows per linkedid
  • [ ] Inbound to 78001/78003 still routing to staging
  • [ ] Inbound to non-staging prod DIDs (e.g. 08065978002 GrandEstancia) still routing locally
  • [ ] Backup /root/cutover-backup-<TS> retained for at least 14 days

Future work after cutover stabilises

  • PR #9: admin UI dropdown on /admin/dids for routing_environment toggle (no more SQL)
  • PR #10: org Phone Numbers UI for default DID selection (per design discussed)
  • PR #11: per-user outbound DID assignment + softphone-path dialplan branches