PROD Monorepo Cutover + DID routing_environment Rollout¶
Status: ✅ COMPLETED 2026-04-17 Owner: Hari Backup retained at: /root/cutover-backup-1776420747 on prod VPS (keep until 2026-05-01) Outcome: Clean cutover. Zero customer downtime. All four prod services (api, editor, workflow-engine, pipecat-flow) migrated to monorepo deploys. Staging forwarding via WireGuard live for AstraPrivate DIDs. Rollback time estimate if needed: ~2 min (revert /opt/
This runbook below documents the plan that was executed. Kept as the reference template for the next platform-level cutover.
This runbook covers the first real deploy through the monorepo CI/CD on production — alongside the schema change required by the DID routing-environment feature merged in PR #8.
Why this is high-stakes¶
- First real monorepo deploy on prod. Per-repo CI/CD on prod was disabled when we set up the monorepo runner. This will be the first push that actually
rsyncs the new monorepo into/opt/<app>on prod. - Schema migration required. PR #8's generator queries
did_numbers.routing_environment. Without theALTER TABLE, every gateway-routing regeneration on prod will throwUnknown column. - Auto-trigger on merge. All four
deploy-*-prod.ymlworkflows fire onpush: branches: [main]withpaths: ['<app>/**']. The moment the merge lands, prod runner picks it up and rsyncs.
Therefore the schema must be in place BEFORE the merge, and we backup before merging.
Pre-flight checks (run before anything destructive)¶
# 1. Prod monorepo runner is online + listening
ssh root@89.116.31.109 'systemctl status actions.runner.astradial-astradial-platform.* --no-pager | head -5'
# 2. Old per-repo runners are NOT running (would also pick up workflows)
ssh root@89.116.31.109 'systemctl list-units --type=service --state=running | grep actions.runner'
# Expect: only the astradial-platform runner; if old per-repo runners are running, stop them first.
# 3. Prod self-hosted runner has 'production' label
# Check via: GitHub repo → Settings → Actions → Runners → astradial-platform runner → labels include "production"
# 4. Prod Asterisk + AMI healthy
ssh root@89.116.31.109 'pm2 status | grep -E "astrapbx|workflow-engine|editor|pipecat"'
# All must be online
# 5. cloud-endpoint-stage PJSIP endpoint exists on prod (so staging-flagged DIDs can forward)
ssh root@89.116.31.109 'asterisk -rx "pjsip show endpoint cloud-endpoint-stage"'
# Must show Status: Avail (qualify reachable)
# 6. WireGuard tunnel prod ↔ staging is up
ssh root@89.116.31.109 'wg show wg0 | grep -A 3 "10.10.10.3"'
# Last handshake must be recent (within minutes)
If any pre-flight fails — STOP. Don't merge.
Backup (mandatory)¶
ssh root@89.116.31.109 << 'EOF'
TS=$(date +%s)
mkdir -p /root/cutover-backup-$TS
# Application code backup
for app in astrapbx pipecat-flow pipecat-flow-editor workflow-engine; do
tar -czf /root/cutover-backup-$TS/$app.tar.gz \
--exclude='node_modules' --exclude='.next' --exclude='.venv' \
-C /opt $app
done
# DB dump
mysqldump -u root pbx_api_db > /root/cutover-backup-$TS/pbx_api_db.sql
# Asterisk configs
tar -czf /root/cutover-backup-$TS/asterisk-etc.tar.gz -C /etc asterisk
ls -lh /root/cutover-backup-$TS/
echo "Backup at: /root/cutover-backup-$TS"
EOF
Note the backup directory path — you'll need it for rollback.
Step 1 — Apply the schema migration on prod¶
This MUST run before the merge so the new code finds the column when it boots.
ssh root@89.116.31.109 << 'EOF'
mysql -u root pbx_api_db <<SQL
ALTER TABLE did_numbers
ADD COLUMN routing_environment ENUM('prod','staging','oss')
NOT NULL DEFAULT 'prod';
SQL
# Verify
mysql -u root pbx_api_db -e "DESCRIBE did_numbers" | grep routing_env
EOF
Expected output:
All existing rows default to prod → no behaviour change yet.
Step 2 — Merge staging → main¶
After pre-flight + backup + schema migration are all GREEN:
cd /Users/hari/AstradialDevelopment/astradial-platform
gh pr create --base main --head staging \
--title "Promote: monorepo cutover + did routing_environment + caller-ID + recording" \
--body "First real prod deploy of the monorepo. Includes:
- Caller ID validation API + dialplan
- Recording defaults to ON
- CDR dedupe
- Admin impersonation + sidebar refresh + DID admin proxy
- Sidebar refresh fixes
- did_numbers.routing_environment column (schema already applied to prod)
- New /api/v1/admin/regenerate-gateway endpoint
- dev-deploy.sh + workflow docs
Schema migration was already applied to prod's pbx_api_db.
Backup at: /root/cutover-backup-<TS> on prod VPS."
# Confirm with the maintainer who owns prod, then:
gh pr merge --merge --delete-branch # use merge-commit, NOT squash, to preserve history
The 4 deploy-*-prod.yml workflows will auto-trigger.
Step 3 — Watch the deploys¶
# Open in another terminal
cd /Users/hari/AstradialDevelopment/astradial-platform
gh run list --branch main --limit 6 --workflow='Deploy API to production'
gh run watch # watches the latest run
Expected sequence (each ~30-60s): - Deploy API to production ✅ - Deploy Workflow to production ✅ - Deploy Pipecat to production ✅ - Deploy Editor to production ✅ (slowest, Next.js build ~3-5 min)
If any fails — see rollback below.
Step 4 — Smoke tests on prod¶
After ALL four deploys succeed:
# 1. Services up
ssh root@89.116.31.109 'pm2 status' # all online, recent uptime
# 2. API responds + new endpoint exists
ssh root@89.116.31.109 'curl -s -X POST -H "X-Internal-Key: $(grep INTERNAL_API_KEY /opt/astrapbx/.env | cut -d= -f2)" http://localhost:8000/api/v1/admin/regenerate-gateway | head -c 200'
# Expect: {"success":true,"didCount":N,"orgCount":M}
# 3. Asterisk dispatcher regenerated correctly
ssh root@89.116.31.109 'grep -A 3 "AstraPrivate" /etc/asterisk/ext_tata_gateway.conf'
# Expect: 78001 + 78003 still local Goto (because they're routing_environment='prod' until Step 5)
# 4. Make a test inbound call to a non-affected DID (e.g., GrandEstancia 08065978002) — should still work as before
# 5. Logs are clean
ssh root@89.116.31.109 'pm2 logs astrapbx --lines 30 --nostream | grep -iE "error|warn"'
Step 5 — Move 78001 + 78003 to staging routing¶
ONLY after Step 4 smoke tests pass:
ssh root@89.116.31.109 << 'EOF'
mysql -u root pbx_api_db -e "
UPDATE did_numbers
SET routing_environment='staging'
WHERE number IN ('+918065978001','+918065978003');
"
mysql -u root pbx_api_db -e "
SELECT number, org_id, pool_status, routing_environment
FROM did_numbers
WHERE number LIKE '%978001%' OR number LIKE '%978003%';
"
EOF
Then regenerate the dispatcher:
ssh root@89.116.31.109 \
'curl -s -X POST -H "X-Internal-Key: $(grep INTERNAL_API_KEY /opt/astrapbx/.env | cut -d= -f2)" http://localhost:8000/api/v1/admin/regenerate-gateway'
Verify:
Expected:
; === AstraPrivate (org_mna9x47k__) ===
exten => 918065978001,1,NoOp(DID 918065978001 -> staging cloud)
exten => 918065978001,n,Dial(PJSIP/918065978001@cloud-endpoint-stage,120)
exten => 918065978001,n,Hangup()
exten => 918065978003,1,NoOp(DID 918065978003 -> staging cloud)
exten => 918065978003,n,Dial(PJSIP/918065978003@cloud-endpoint-stage,120)
exten => 918065978003,n,Hangup()
Step 6 — Verify staging inbound works¶
From your mobile dial +91 80659 78001. Expected: - macOS Telephone.app (registered as ext 1001 on stagesip.astradial.com) rings - You answer - Two-way audio works - Call appears in stagepbx call history
Rollback (if any step fails)¶
Rollback application code¶
ssh root@89.116.31.109 << 'EOF'
TS_DIR=/root/cutover-backup-<TS> # the backup dir from the Backup step
for app in astrapbx pipecat-flow pipecat-flow-editor workflow-engine; do
rm -rf /opt/$app
mkdir -p /opt/$app
tar -xzf $TS_DIR/$app.tar.gz -C /opt
pm2 restart $app --update-env
done
pm2 status
EOF
Rollback schema (only if absolutely necessary — column is no-op for old code)¶
ssh root@89.116.31.109 'mysql -u root pbx_api_db -e "ALTER TABLE did_numbers DROP COLUMN routing_environment;"'
Rollback dispatcher (revert 78001/78003 to local routing)¶
ssh root@89.116.31.109 << 'EOF'
mysql -u root pbx_api_db -e "
UPDATE did_numbers SET routing_environment='prod'
WHERE number IN ('+918065978001','+918065978003');
"
curl -s -X POST -H "X-Internal-Key: $(grep INTERNAL_API_KEY /opt/astrapbx/.env | cut -d= -f2)" \
http://localhost:8000/api/v1/admin/regenerate-gateway
EOF
Rollback the merge (last resort)¶
cd /Users/hari/AstradialDevelopment/astradial-platform
git revert -m 1 <merge-commit-sha>
git push origin main
# This re-fires the prod deploys with the OLD code.
Post-cutover verification¶
After 24-48h of stable operation:
- [ ] No new errors in
pm2 logs astrapbx --lines 200 | grep -iE "error|warn" - [ ] No customer-reported call failures
- [ ] Outbound recording files showing up in
/var/spool/asterisk/monitor/ - [ ] CDR shows no duplicate rows per linkedid
- [ ] Inbound to 78001/78003 still routing to staging
- [ ] Inbound to non-staging prod DIDs (e.g. 08065978002 GrandEstancia) still routing locally
- [ ] Backup
/root/cutover-backup-<TS>retained for at least 14 days
Future work after cutover stabilises¶
- PR #9: admin UI dropdown on
/admin/didsfor routing_environment toggle (no more SQL) - PR #10: org Phone Numbers UI for default DID selection (per design discussed)
- PR #11: per-user outbound DID assignment + softphone-path dialplan branches