Incident: Apr 14 — Monorepo Migration & Prod Outbound Breakage¶
Timeline¶
| Time (IST) | Event |
|---|---|
| 22:00 | Started monorepo CI/CD setup for astradial-platform |
| 22:15 | API staging deploy ran npm ci --omit=dev — missing firebase-admin broke Firebase auth |
| 22:20 | rsync --delete removed firebase-sa-key.json from staging API |
| 22:30 | Editor deployed with stale code (monorepo had 40-commit-old editor copy) |
| 22:45 | handleUnauthorized kicked admin out on 401 during API restart |
| 23:00 | Fixed editor code, Firebase SA key, firebase-admin dep |
| 23:15 | Attempted staging outbound calls — failed, no path to NUC |
| 23:30 | MISTAKE: Restarted prod Asterisk to add staging trunk endpoint |
| 23:35 | Prod outbound calling broke — tata_gateway Dial returns busy/congested |
| 23:45 | Reverted prod pjsip config, restarted again — still broken |
| 23:50 | Discovered AMI disconnected after restart — reloaded API, AMI reconnected |
| 00:00 | Prod outbound restored via AMI reconnect |
| 00:15 | Staged outbound through prod from-cloud context — NUC returned 403 (bad CallerID) |
| 00:30 | Multiple NUC dialplan changes trying different number formats |
| 00:45 | Discovered NUC tata-endpoint returns CHANUNAVAIL after restart |
| 01:00 | Root cause: NUC Asterisk 22 needs qualify_frequency>0 and max_contacts=1 on tata-aor for outbound to work after restart |
| 01:10 | Set up direct WireGuard routing: staging → prod (forwarding) → NUC |
| 01:25 | Fixed NUC from-cloud: CallerID +918065978001, dial format 0${EXTEN} |
| 01:30 | Staging outbound working |
What Went Wrong¶
1. Monorepo editor was 40 commits behind¶
The local copy of astradial-editor repo was at an old commit. The monorepo was built from the local working directory, not from git pull origin main. Missing: CRM, DID marketplace, admin pages, mobile responsive, RBAC.
Lesson
Always git pull before copying code into another repo. Verify with git log that the local branch matches remote.
2. firebase-admin not in package.json¶
The firebase-admin npm package was installed manually on prod/staging but never added to package.json. When monorepo CI/CD ran npm ci --omit=dev, it wasn't installed.
Lesson
Every production dependency must be in package.json. Run npm ls firebase-admin to verify before deploying.
3. rsync --delete removed server-side files¶
The CI/CD rsync -a --delete removed firebase-sa-key.json and .env files that only existed on the server.
Fix applied
All deploy workflows now exclude: .env, .env.local, firebase-sa-key.json, recordings/
4. Prod Asterisk restart broke outbound¶
Restarting Asterisk on prod caused tata_gateway trunk to return busy/congested on all outbound Dial attempts. The PJSIP endpoint with static contact lost its ability to create outbound channels.
NEVER DO THIS
Never restart prod Asterisk without explicit instruction. The AMI connection breaks, PJSIP endpoints may lose channel capability, and all active calls drop.
5. NUC tata-endpoint CHANUNAVAIL after restart¶
On Asterisk 22, a PJSIP AOR with qualify_frequency=0 and max_contacts=0 results in CHANUNAVAIL for outbound Dial after restart. The contact exists but is NonQual, and Asterisk 22 requires Available status.
Fix
Set qualify_frequency=30 and max_contacts=1 on tata-aor in NUC pjsip.conf.
6. NUC from-cloud wrong number format¶
The original from-cloud context used Dial(PJSIP/0${EXTEN}@tata-endpoint) and Set(CALLERID(num)=+91${CALLERID(num):1}). This produced:
- CallerID:
+91918065978001(double country code — original CallerID already had +91) - Dial:
09944421125— worked before qualify fix but broke after
Working format
What Was Fixed¶
Staging outbound call path¶
Staging Asterisk (94.136.188.221)
→ tata_gateway endpoint → sip:10.10.10.2:5060
→ WireGuard wg0 (10.10.10.3)
→ Prod VPS (10.10.10.1) — IP forwarding only, no Asterisk involvement
→ WireGuard → NUC (10.10.10.2)
→ NUC Asterisk → from-cloud context
→ Dial(PJSIP/0${EXTEN}@tata-endpoint)
→ Tata SBC (10.79.215.102) → PSTN
WireGuard changes¶
Prod (89.116.31.109):
- Enabled
net.ipv4.ip_forward=1in/etc/sysctl.conf - Added iptables FORWARD rules for wg0:
10.10.10.3 ↔ 10.10.10.2
Staging (94.136.188.221):
- Added
10.10.10.2/32to AllowedIPs for prod peer in/etc/wireguard/wg0.conf
NUC (nuc.astradial.com):
- Added
10.10.10.3/32to AllowedIPs for prod peer in/etc/wireguard/wg0.conf
NUC Asterisk changes¶
pjsip.conf:
cloud-identify: Addedmatch=10.10.10.3(accept staging as cloud endpoint)tata-aor: Addedqualify_frequency=30andmax_contacts=1
extensions.conf — [from-cloud]:
[from-cloud]
exten => _X.,1,NoOp(Cloud Outbound via Tata: ${EXTEN} CID: ${CALLERID(all)})
same => n,Set(CALLERID(all)=+918065978001)
same => n,Dial(PJSIP/0${EXTEN}@tata-endpoint,60)
same => n,NoOp(Tata dial status: ${DIALSTATUS})
same => n,Hangup()
Staging Asterisk changes¶
pjsip_tata_gateway.conf:
- Changed
contact=sip:10.10.10.2:5060(direct to NUC, was10.10.10.1pointing to prod) - Changed
match=10.10.10.2(identify NUC directly)
Recovery Checklist¶
If staging outbound breaks again:
-
Check NUC Tata contact status:
-
Check WireGuard connectivity:
-
Check prod IP forwarding:
-
Test NUC outbound directly:
Backups¶
All configs backed up at /root/asterisk-backup-20260415/ on:
- Prod (89.116.31.109)
- Staging (94.136.188.221)
- NUC (nuc.astradial.com)