Skip to content

Incident: Apr 14 — Monorepo Migration & Prod Outbound Breakage

Timeline

Time (IST) Event
22:00 Started monorepo CI/CD setup for astradial-platform
22:15 API staging deploy ran npm ci --omit=dev — missing firebase-admin broke Firebase auth
22:20 rsync --delete removed firebase-sa-key.json from staging API
22:30 Editor deployed with stale code (monorepo had 40-commit-old editor copy)
22:45 handleUnauthorized kicked admin out on 401 during API restart
23:00 Fixed editor code, Firebase SA key, firebase-admin dep
23:15 Attempted staging outbound calls — failed, no path to NUC
23:30 MISTAKE: Restarted prod Asterisk to add staging trunk endpoint
23:35 Prod outbound calling broke — tata_gateway Dial returns busy/congested
23:45 Reverted prod pjsip config, restarted again — still broken
23:50 Discovered AMI disconnected after restart — reloaded API, AMI reconnected
00:00 Prod outbound restored via AMI reconnect
00:15 Staged outbound through prod from-cloud context — NUC returned 403 (bad CallerID)
00:30 Multiple NUC dialplan changes trying different number formats
00:45 Discovered NUC tata-endpoint returns CHANUNAVAIL after restart
01:00 Root cause: NUC Asterisk 22 needs qualify_frequency>0 and max_contacts=1 on tata-aor for outbound to work after restart
01:10 Set up direct WireGuard routing: staging → prod (forwarding) → NUC
01:25 Fixed NUC from-cloud: CallerID +918065978001, dial format 0${EXTEN}
01:30 Staging outbound working

What Went Wrong

1. Monorepo editor was 40 commits behind

The local copy of astradial-editor repo was at an old commit. The monorepo was built from the local working directory, not from git pull origin main. Missing: CRM, DID marketplace, admin pages, mobile responsive, RBAC.

Lesson

Always git pull before copying code into another repo. Verify with git log that the local branch matches remote.

2. firebase-admin not in package.json

The firebase-admin npm package was installed manually on prod/staging but never added to package.json. When monorepo CI/CD ran npm ci --omit=dev, it wasn't installed.

Lesson

Every production dependency must be in package.json. Run npm ls firebase-admin to verify before deploying.

3. rsync --delete removed server-side files

The CI/CD rsync -a --delete removed firebase-sa-key.json and .env files that only existed on the server.

Fix applied

All deploy workflows now exclude: .env, .env.local, firebase-sa-key.json, recordings/

4. Prod Asterisk restart broke outbound

Restarting Asterisk on prod caused tata_gateway trunk to return busy/congested on all outbound Dial attempts. The PJSIP endpoint with static contact lost its ability to create outbound channels.

NEVER DO THIS

Never restart prod Asterisk without explicit instruction. The AMI connection breaks, PJSIP endpoints may lose channel capability, and all active calls drop.

5. NUC tata-endpoint CHANUNAVAIL after restart

On Asterisk 22, a PJSIP AOR with qualify_frequency=0 and max_contacts=0 results in CHANUNAVAIL for outbound Dial after restart. The contact exists but is NonQual, and Asterisk 22 requires Available status.

Fix

Set qualify_frequency=30 and max_contacts=1 on tata-aor in NUC pjsip.conf.

6. NUC from-cloud wrong number format

The original from-cloud context used Dial(PJSIP/0${EXTEN}@tata-endpoint) and Set(CALLERID(num)=+91${CALLERID(num):1}). This produced:

  • CallerID: +91918065978001 (double country code — original CallerID already had +91)
  • Dial: 09944421125 — worked before qualify fix but broke after

Working format

Set(CALLERID(all)=+918065978001)   ; Use CALLERID(all) not CALLERID(num)
Dial(PJSIP/0${EXTEN}@tata-endpoint,60)  ; 0-prefix = local dialing format

What Was Fixed

Staging outbound call path

Staging Asterisk (94.136.188.221)
  → tata_gateway endpoint → sip:10.10.10.2:5060
  → WireGuard wg0 (10.10.10.3)
  → Prod VPS (10.10.10.1) — IP forwarding only, no Asterisk involvement
  → WireGuard → NUC (10.10.10.2)
  → NUC Asterisk → from-cloud context
  → Dial(PJSIP/0${EXTEN}@tata-endpoint)
  → Tata SBC (10.79.215.102) → PSTN

WireGuard changes

Prod (89.116.31.109):

  • Enabled net.ipv4.ip_forward=1 in /etc/sysctl.conf
  • Added iptables FORWARD rules for wg0: 10.10.10.3 ↔ 10.10.10.2

Staging (94.136.188.221):

  • Added 10.10.10.2/32 to AllowedIPs for prod peer in /etc/wireguard/wg0.conf

NUC (nuc.astradial.com):

  • Added 10.10.10.3/32 to AllowedIPs for prod peer in /etc/wireguard/wg0.conf

NUC Asterisk changes

pjsip.conf:

  • cloud-identify: Added match=10.10.10.3 (accept staging as cloud endpoint)
  • tata-aor: Added qualify_frequency=30 and max_contacts=1

extensions.conf — [from-cloud]:

[from-cloud]
exten => _X.,1,NoOp(Cloud Outbound via Tata: ${EXTEN} CID: ${CALLERID(all)})
 same => n,Set(CALLERID(all)=+918065978001)
 same => n,Dial(PJSIP/0${EXTEN}@tata-endpoint,60)
 same => n,NoOp(Tata dial status: ${DIALSTATUS})
 same => n,Hangup()

Staging Asterisk changes

pjsip_tata_gateway.conf:

  • Changed contact=sip:10.10.10.2:5060 (direct to NUC, was 10.10.10.1 pointing to prod)
  • Changed match=10.10.10.2 (identify NUC directly)

Recovery Checklist

If staging outbound breaks again:

  1. Check NUC Tata contact status:

    ssh user@nuc.astradial.com
    sudo asterisk -x "pjsip show contacts" | grep tata
    # Must show "Avail" — if "NonQual" or "Unavail", restart NUC Asterisk
    

  2. Check WireGuard connectivity:

    # From staging:
    ping -c 2 10.10.10.2   # Must reach NUC
    # From NUC:
    ping -c 2 10.10.10.3   # Must reach staging
    

  3. Check prod IP forwarding:

    ssh root@89.116.31.109
    sysctl net.ipv4.ip_forward          # Must be 1
    iptables -L FORWARD | grep wg0      # Must have ACCEPT rules
    

  4. Test NUC outbound directly:

    sudo asterisk -x "channel originate Local/919944421125@testonly-outbound application Wait 30"
    # If this works, NUC → Tata is fine; issue is upstream
    

Backups

All configs backed up at /root/asterisk-backup-20260415/ on:

  • Prod (89.116.31.109)
  • Staging (94.136.188.221)
  • NUC (nuc.astradial.com)