Troubleshooting¶

This page documents all known errors encountered in the Astradial infrastructure, their root causes, and fixes.

Error 1: Asterisk Won't Start -- Bind to NNI IP Fails¶

Symptom¶

Asterisk fails to start with an error about being unable to bind to the NNI interface IP address.

Root Cause¶

The PJSIP transport was configured to bind to the NNI interface IP (e.g., 172.16.x.x). When the NNI interface (enp86s0) is DOWN -- for example after a reboot before the interface comes up -- Asterisk cannot bind to that IP and refuses to start.

Diagnosis¶

systemctl status asterisk
# Look for bind errors in the log
journalctl -u asterisk | grep -i bind

Fix¶

Change the PJSIP transport bind address to 0.0.0.0 instead of the specific NNI IP:

[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0:5060

Prevention Rule¶

Never bind PJSIP transports to a specific interface IP. Always use 0.0.0.0 to bind on all interfaces.

Error 2: Tata Calls "Number Incorrect"¶

Symptom¶

Outbound or inbound calls via Tata fail with "number incorrect" or similar rejection. Calls may partially work but SIP responses are not reaching Tata.

Root Cause¶

Two issues combined:

Routes for the Tata network were missing from the NUC routing table.
SIP responses were going out the wrong network interface (default route instead of the NNI interface).

Diagnosis¶

# Check routing table
ip route show

# Check which interface SIP traffic uses
tcpdump -i enp86s0 port 5060
tcpdump -i any port 5060

Fix¶

Add explicit routes to ensure Tata SIP traffic goes through the NNI gateway:

ip route add 10.10.10.0/24 via <nni-gateway> dev enp86s0

Make routes persistent in /etc/network/interfaces.d/enp86s0.

Prevention Rule¶

Always verify routing after network changes. SIP responses must leave through the same interface the request arrived on.

Error 3: tata-endpoint Shows "Unavailable"¶

Symptom¶

pjsip show endpoints shows the tata_gateway endpoint status as "Unavailable".

Root Cause¶

Asterisk sends SIP OPTIONS to qualify endpoints. Tata's SBC ignores OPTIONS requests and never responds, so Asterisk marks the endpoint as unavailable.

Diagnosis¶

asterisk -rx "pjsip show endpoint tata_gateway"
asterisk -rx "pjsip show aor tata_gateway"

Fix¶

Disable qualify by setting qualify_frequency=0 in the AOR:

[tata_gateway]
type=aor
contact=sip:10.10.10.2:5060
qualify_frequency=0

Prevention Rule¶

This is expected behavior -- the endpoint will always show "Unavailable". Do not use endpoint status to determine if the Tata trunk is working. Test with actual calls instead.

Error 4: PJSIP_EFAILEDCREDENTIAL¶

Symptom¶

SIP authentication fails with PJSIP_EFAILEDCREDENTIAL in Asterisk logs when the NUC tries to authenticate with the cloud Asterisk.

Root Cause¶

There is an authentication incompatibility between Asterisk 22 and Asterisk 20 (or between certain PJSIP versions). Username/password-based SIP authentication fails across these versions.

Diagnosis¶

# On cloud Asterisk
asterisk -rvvvvv
# Look for EFAILEDCREDENTIAL in SIP debug
pjsip set logger on

Fix¶

Replace username/password authentication with IP-based identification:

[tata_gateway]
type=endpoint
identify_by=ip

[tata_gateway]
type=identify
endpoint=tata_gateway
match=10.10.10.2

Prevention Rule¶

Use identify_by=ip for machine-to-machine SIP connections where both sides have known, fixed IPs (e.g., over WireGuard).

Error 5: systemctl Shows "active (exited)" but Nothing Running¶

Symptom¶

systemctl status asterisk shows active (exited) but Asterisk is not actually running. No process is found.

Root Cause¶

The default Asterisk package installs an init.d script that systemd wraps. The init.d script has a bug where it reports success (exits 0) even when Asterisk fails to start or is not running.

Diagnosis¶

systemctl status asterisk
# Shows "active (exited)" instead of "active (running)"
ps aux | grep asterisk
# No asterisk process found

Fix¶

Create a proper systemd service file that manages Asterisk directly instead of relying on the init.d wrapper:

[Unit]
Description=Asterisk PBX
After=network.target

[Service]
Type=simple
ExecStart=/usr/sbin/asterisk -f -C /etc/asterisk/asterisk.conf
ExecReload=/usr/sbin/asterisk -rx "core reload"
Restart=on-failure

[Install]
WantedBy=multi-user.target

Prevention Rule¶

Always use native systemd service files. Do not rely on init.d compatibility wrappers.

Error 6: enp86s0 DOWN After Reboot¶

Symptom¶

After a NUC reboot, the NNI interface (enp86s0) is DOWN. The Tata trunk has no network connectivity.

Root Cause¶

The NNI interface was configured manually (via ip commands) but not persisted in network configuration files. After reboot, the interface remains unconfigured and DOWN.

Diagnosis¶

ip link show enp86s0
# State: DOWN
ip addr show enp86s0
# No IP address assigned

Fix¶

Create a persistent configuration in /etc/network/interfaces.d/enp86s0:

auto enp86s0
iface enp86s0 inet static
    address <nni-ip>
    netmask <netmask>
    gateway <nni-gateway>

Prevention Rule¶

Never configure network interfaces with transient ip commands alone. Always persist configuration in /etc/network/interfaces.d/.

Error 7: First Call After Restart Fails¶

Symptom¶

The first inbound or outbound call after an Asterisk restart fails. Subsequent calls work fine.

Root Cause¶

After Asterisk restarts, SIP registrations and endpoint discovery take time to complete. If a call arrives before the Tata gateway endpoint is fully initialized (30-60 seconds), it will fail.

Diagnosis¶

# Check how long ago Asterisk started
asterisk -rx "core show uptime"

# Check endpoint status
asterisk -rx "pjsip show endpoints"

Fix¶

Wait 30-60 seconds after restarting Asterisk before routing live calls. There is no configuration fix -- this is inherent to the SIP registration process.

Prevention Rule¶

After any Asterisk restart, verify the trunk is ready by placing a test call before routing production traffic. Avoid restarting Asterisk during business hours when possible.

Error 8: NUC Public IP Changed -- Cloud Rejected¶

Symptom¶

The NUC's SIP connection to the cloud Asterisk stops working. The cloud Asterisk rejects packets from the NUC because its public IP has changed (ISP dynamic IP).

Root Cause¶

The NUC's ISP assigns a dynamic public IP. When it changes, the cloud Asterisk's IP-based identification no longer matches, and all traffic from the NUC is rejected.

Diagnosis¶

# On NUC: check current public IP
curl ifconfig.me

# On cloud: check what IP the identify section expects
asterisk -rx "pjsip show identify tata_gateway"

Fix¶

Use a WireGuard VPN tunnel between the NUC and cloud server. WireGuard assigns fixed tunnel IPs (10.10.10.1 for cloud, 10.10.10.2 for NUC) regardless of the underlying public IP.

; Cloud Asterisk uses WireGuard IP for identification
[tata_gateway]
type=identify
endpoint=tata_gateway
match=10.10.10.2

Prevention Rule¶

Never rely on public IPs for SIP endpoint identification when either side has a dynamic IP. Always use a VPN tunnel with fixed IPs.

Error 9: Zoiper 401 Unauthorized¶

Symptom¶

Zoiper softphone on the NUC's local network receives 401 Unauthorized when trying to register with the cloud Asterisk, even though credentials are correct.

Root Cause¶

The NUC and Zoiper share the same public IP (both are behind the same NAT). The cloud Asterisk's identify_by=ip matches the source IP to the tata_gateway endpoint before evaluating Zoiper's credentials, causing an identification conflict.

Diagnosis¶

# On cloud Asterisk
pjsip set logger on
# Observe that Zoiper REGISTER is matched to tata_gateway endpoint

Fix¶

The WireGuard tunnel solves this. The NUC's Asterisk traffic goes through the tunnel (source IP 10.10.10.2), while Zoiper's traffic goes through the public IP. These are now different source IPs, so identification works correctly.

Prevention Rule¶

When using identify_by=ip, ensure each endpoint has a unique source IP. Use VPN tunnels to separate traffic from hosts sharing a public IP.

Error 10: ISP Blocks UDP 5060¶

Symptom¶

SIP softphones on certain networks cannot register or make calls. Error messages include "SIP UDP not found" or connection timeouts.

Root Cause¶

Some ISPs block outbound UDP traffic on port 5060 as a measure against SIP abuse or toll fraud.

Diagnosis¶

# From the affected network
nc -zuv 82.180.146.80 5060
# Timeout or connection refused

# Try alternate port
nc -zuv 82.180.146.80 5080
# Success

Fix¶

Configure an alternate SIP transport on port 5080:

[transport-udp-alt]
type=transport
protocol=udp
bind=0.0.0.0:5080
external_media_address=82.180.146.80
external_signaling_address=82.180.146.80

Clients behind restrictive ISPs connect to port 5080 instead of 5060.

Prevention Rule¶

Always provide an alternate SIP port (5080) as a standard part of the deployment. Document it for end users experiencing connectivity issues.

Error 11: NUC Crashed During Netdata Build¶

Symptom¶

The NUC became unresponsive and powered off during Netdata compilation. After power-on, no Netdata installation was present.

Root Cause¶

Compiling Netdata from source is CPU-intensive. The NUC's passive cooling was insufficient, causing the CPU to overheat and trigger a thermal shutdown.

Diagnosis¶

# After reboot, check thermal events
journalctl -b -1 | grep -i thermal
dmesg | grep -i thermal

Fix¶

Install Netdata using the --static-only flag, which downloads a pre-built static binary instead of compiling from source:

bash <(curl -Ss https://my-netdata.io/kickstart.sh) --static-only

Prevention Rule¶

Never compile large software from source on passively-cooled or low-power hardware. Always use pre-built binaries or static builds.

Error 12: Upptime 404¶

Symptom¶

The Upptime status page at status.astradial.com returns a 404 error. The GitHub Actions workflows run successfully and the site appears to build, but the page is not accessible.

Root Cause¶

The Upptime workflows build the static site but do not deploy it to GitHub Pages. Without a deployment step, the gh-pages branch is never updated and GitHub Pages has nothing to serve.

Diagnosis¶

# Check if gh-pages branch exists and has recent commits
gh api repos/astradial/upptime/branches/gh-pages

Fix¶

Add the peaceiris/actions-gh-pages action to the workflow to deploy the built site to the gh-pages branch:

- uses: peaceiris/actions-gh-pages@v3
  with:
    github_token: ${{ secrets.GITHUB_TOKEN }}
    publish_dir: ./build

Prevention Rule¶

When using GitHub Pages with a build step, always include an explicit deployment action. Verify the gh-pages branch is being updated after workflow runs.

Error 13: POST /api/v1/users Returns 500 "Validation error"¶

Symptom¶

Creating a user via POST /api/v1/users returns HTTP 500 with {"error":"Validation error"}, even though all required fields are provided and valid.

Root Cause¶

The username column in the users table had a global unique constraint. In a multi-tenant system, this meant no two organizations could have a user with the same username. MySQL/MariaDB's default case-insensitive collation made this worse -- "Hari" and "hari" were treated as duplicates.

The catch block in the endpoint returned error.message directly, and Sequelize's SequelizeUniqueConstraintError has the message "Validation error" -- making the real cause invisible.

Diagnosis¶

# Check PM2 logs for the 500
pm2 logs astrapbx --lines 50 --nostream

# Check existing users for duplicate usernames
cd /opt/astrapbx && node -e "
const db = require('./src/models');
db.sequelize.authenticate().then(() =>
  db.User.findAll({attributes: ['username','org_id']})
    .then(u => { console.log(JSON.stringify(u, null, 2)); process.exit(); })
);
"

# Check indexes on users table
cd /opt/astrapbx && node -e "
const db = require('./src/models');
db.sequelize.getQueryInterface().showIndex('users').then(indexes => {
  console.log(JSON.stringify(indexes.filter(i => i.unique), null, 2));
  process.exit();
});
"

Fix¶

1. Changed the unique constraint from global to per-org (src/models/User.js):

Removed unique: true from the username field and added a composite unique index:

// Before
username: { type: DataTypes.STRING(50), allowNull: false, unique: true }

// After
username: { type: DataTypes.STRING(50), allowNull: false }
// With composite index in model options:
indexes: [{ unique: true, fields: ['org_id', 'username'], name: 'unique_org_username' }]

2. Added proper error handling (src/server.js -- POST /users catch block):

} catch (error) {
  if (error.name === 'SequelizeUniqueConstraintError') {
    return res.status(409).json({ error: 'Username already exists' });
  }
  if (error.name === 'SequelizeValidationError') {
    return res.status(400).json({ error: error.errors.map(e => e.message).join(', ') });
  }
  res.status(500).json({ error: error.message });
}

3. Added app-level username uniqueness check (before User.create):

const existingUsername = await User.findOne({ where: { org_id: req.orgId, username } });
if (existingUsername) {
  return res.status(409).json({ error: 'Username already exists' });
}

4. Database migration (run once on the server):

cd /opt/astrapbx && node -e "
const db = require('./src/models');
const qi = db.sequelize.getQueryInterface();
(async () => {
  await qi.removeIndex('users', 'username');
  await qi.addIndex('users', ['org_id', 'username'], { unique: true, name: 'unique_org_username' });
  process.exit();
})();
"

Prevention Rule¶

In multi-tenant systems, unique constraints on user-facing fields (username, extension, etc.) must always be scoped to the organization (org_id), never global. Always handle SequelizeUniqueConstraintError and SequelizeValidationError explicitly in catch blocks -- never let them fall through as generic 500 errors.

Error 14: Outbound Calls 403 Forbidden -- Caller ID Format¶

Symptom¶

Outbound calls via the Tata trunk return 403 Forbidden - 6034. The call connects to the NUC gateway successfully, the NUC forwards to Tata's SBC, but Tata rejects immediately.

Root Cause¶

Two issues combined:

No caller ID set on outbound route — The outbound route's caller_id_override was null, so Asterisk sent the internal extension number (e.g., 1002) or anonymous as the caller ID. Tata rejects calls without a valid DID as caller ID.
NUC double-prefixed the caller ID — The NUC's from-cloud context did Set(CALLERID(num)=+91${CALLERID(num)}). When the cloud sent 08065978002 (with leading 0), the NUC produced +9108065978002 — an invalid E.164 number.

Diagnosis¶

# On the cloud server - check outbound route caller ID
curl -s -H 'X-API-Key: org_XXXX' \
  http://localhost:8000/api/v1/outbound-routes | python3 -m json.tool

# On the NUC - enable SIP logger and check the INVITE sent to Tata
asterisk -rx 'pjsip set logger on'
# Look for the From: header in the INVITE to 10.79.215.102

SIP trace showing the problem:

INVITE sip:07400464659@10.79.215.102:5060
From: "GrandEstancia" <sip:+9108065978002@192.168.0.14>
                            ^^^^^^^^^^^^ double-prefixed

SIP/2.0 403 Forbidden - 6034

Fix¶

1. Set caller ID on the outbound route (cloud API):

curl -X PUT -H 'X-API-Key: org_XXXX' \
  -H 'Content-Type: application/json' \
  -d '{"caller_id_override": "08065978002", "caller_id_name_override": "CustomerName"}' \
  http://localhost:8000/api/v1/outbound-routes/{routeId}

Then redeploy: POST /api/v1/config/deploy with {"reload": true}.

2. Fix NUC caller ID transformation (/etc/asterisk/extensions.conf on NUC):

; Before (wrong - double-prefixes numbers starting with 0)
same => n,Set(CALLERID(num)=+91${CALLERID(num)})

; After (correct - strips leading 0, then adds +91)
same => n,Set(CALLERID(num)=+91${CALLERID(num):1})

Reload NUC dialplan: asterisk -rx 'dialplan reload'

Prevention Rule¶

Every outbound route must have caller_id_override set to the org's DID number. The NUC's from-cloud context must strip the leading 0 before adding +91 — use ${CALLERID(num):1} (Asterisk substring notation: skip first character).

Error 15: Codec Translation Error ulaw to opus¶

Symptom¶

Internal calls between extensions fail. Asterisk logs show:

Unable to find a codec translation path: (ulaw) -> (opus)

The call goes straight to "offline" playback without ringing.

Root Cause¶

Asterisk has res_format_attr_opus.so loaded (format description module) but not codec_opus.so (actual transcoder). When a user endpoint allows opus and the other side uses ulaw (e.g., Local channels for outbound routing, or trunk calls), Asterisk cannot transcode between them.

Diagnosis¶

# Check if opus codec module exists
asterisk -rx 'module show like opus'
# Shows res_format_attr_opus but NOT codec_opus

# Check translation paths
asterisk -rx 'core show translation paths ulaw' | grep opus
# Shows "No Translation Path"

Fix¶

Remove opus from user endpoint allow lists. The endpoints still get HD audio via g722:

// In userProvisioningService.js
// Before
config += `allow=ulaw,alaw,g722,opus\n`;

// After
config += `allow=ulaw,alaw,g722\n`;

Redeploy config and reload PJSIP: asterisk -rx 'pjsip reload'

Users must re-register their softphones (disconnect/reconnect in Zoiper) to renegotiate codecs.

Prevention Rule¶

Only include codecs in endpoint allow lists if the corresponding codec translator module is installed. Use core show translation paths to verify translation paths exist before adding a codec.

Error 16: Phone Number Forwarding Not Working (ring_target=phone)¶

Symptom¶

User extension has ring_target: "phone" and phone_number set, but calling the extension plays "is not available" instead of ringing the external phone.

Root Cause¶

Multiple issues can cause this:

No outbound route configured — ring_target: "phone" uses Dial(Local/number@outbound_context) which requires a working outbound route
Caller ID not set on outbound route — Tata returns 403 (see Error 14)
Opus codec error — Local channel cannot transcode (see Error 15)
Config not redeployed — Database updated but Asterisk config not regenerated

Diagnosis¶

# Check the user's settings
curl -s -H 'X-API-Key: org_XXXX' \
  http://localhost:8000/api/v1/users/{userId} | python3 -m json.tool | grep -E 'ring_target|phone_number'

# Check the generated dialplan
asterisk -rx 'dialplan show {extension}@{context_prefix}__internal'
# Should show: Dial(Local/{phone_number}@{context_prefix}__outbound/n,30,tT)

# Enable verbose logging and make a test call
asterisk -rx 'core set verbose 5'
# Check full.log for DIALSTATUS=CHANUNAVAIL

Fix¶

Ensure all prerequisites are in place:

Outbound route exists with caller_id_override set to the DID
Opus removed from endpoint allow lists (or codec_opus installed)
NUC from-cloud context correctly transforms caller ID
Config deployed: POST /api/v1/config/deploy with {"reload": true}

Prevention Rule¶

ring_target: "phone" depends on the full outbound call chain working: cloud outbound context → trunk → NUC → Tata. Test outbound calling from Zoiper first before enabling phone forwarding.

Error 17: Queue Rings Zoiper Instead of Phone Number¶

Symptom¶

User has ring_target=phone and phone_number set, but when called through a queue (e.g., reception 5001), their Zoiper SIP client rings instead of their phone number. If Zoiper is not registered, the queue skips the member entirely (shows "Unavailable").

Root Cause¶

Queue members in queues.conf were generated as PJSIP/endpoint which rings the SIP endpoint directly, bypassing the extension dialplan where ring_target=phone logic lives.

Additionally, the state_interface was set to PJSIP/endpoint — when Zoiper wasn't registered, Asterisk marked the member as "Unavailable" and skipped them.

Fix¶

Modified queueService.js → generateQueueMemberString():

For users with ring_target=phone: use Local/extension@internal_context/n (no state_interface)
For users with routing_type=ai_agent: use Local/extension@internal_context/n (no state_interface)
For regular SIP users: keep PJSIP/endpoint (unchanged)

This routes queue calls through the extension dialplan where ring_target routing occurs.

# Before (broken)
member => PJSIP/org_mnd5khym_1001,0,"Hari",PJSIP/org_mnd5khym_1001

# After (fixed — phone user, no state_interface)
member => Local/1001@org_mnd5khym__internal/n,0,"Hari"

Verification¶

# Check queue member interfaces
asterisk -rx "queue show {queue_name}"
# Should show Local/... for phone users, PJSIP/... for SIP users
# All phone users should show "Not in use", not "Unavailable"

Error 18: Queue Timeout Not Working — Calls Ring Forever¶

Symptom¶

Queue max_wait_time set to 45 seconds via API, but calls keep ringing members indefinitely (5+ minutes) without routing to the timeout destination (e.g., AI agent on ext 1003).

Root Cause¶

The Queue() dialplan application had an extra comma, placing the timeout value in the AGI parameter slot (position 6) instead of the timeout slot (position 5):

# Bug — 45 is in position 6 (AGI), not position 5 (timeout)
Queue(org_mnd5khym_5001,cCtr,,,,45)

# Correct syntax
Queue(queuename,options,URL,announceoverride,timeout,AGI,...)

Fix¶

Fixed in dialplanGenerator.js → generateQueueExtension():

// Before (extra comma)
Queue(${name},${options},,,,${maxWait})

// After (correct)
Queue(${name},${options},,,${maxWait})

Verification¶

# Check generated dialplan
asterisk -rx "dialplan show 5001@org_mnd5khym__queue"
# Queue() should have exactly 4 commas before the timeout value
# e.g., Queue(org_mnd5khym_5001,cCtr,,,45)

# Test: make a call and verify it exits queue after timeout
# Check logs for: QUEUESTATUS=TIMEOUT and Goto to failover

Error 19: Queue Linear Strategy Rings Same Member Repeatedly¶

Symptom¶

With strategy=linear, the queue always rings the same member (lowest penalty) and never cycles to the next member, even after the first member doesn't answer.

Root Cause¶

linear strategy always starts from the first available member. With Local/ channel queue members (used for phone routing), there's no persistent PJSIP device state — all members appear as "Not in use" after each ring attempt. The queue picks the first member every time.

Fix¶

Changed queue strategy to rrmemory (round-robin with memory) via API:

curl -X PUT "https://devpbx.astradial.com/api/v1/queues/{id}" \
  -H "X-API-Key: org_xxx" \
  -d '{"strategy": "rrmemory"}'

Set equal penalties for all members so rrmemory cycles through them evenly.

Strategy Reference¶

Strategy	Behavior	Best For
`ringall`	Ring all members simultaneously	Small teams, fastest answer
`rrmemory`	Round-robin, remembers last called	Equal distribution
`linear`	Always starts from first member	Priority-based (use with PJSIP only)
`leastrecent`	Rings member who was called least recently	Fair distribution
`fewestcalls`	Rings member with fewest completed calls	Load balancing

Error 20: Auto-Tickets Created for Unanswered Outbound AI Bot Calls¶

Symptom¶

Tickets with category missed_call appear in the editor's Tickets tab for phone numbers the system itself just dialled out to via the AI bot. The ticket creation timestamp lines up exactly with a workflow-engine scheduled job firing that originated an outbound AI welcome call.

Root Cause¶

For every AI bot outbound call, two CDR rows end up in asterisk_cdr:

The manual row AstraPBX inserts at server.js:3741 (dcontext='ai-outbound', disposition='ANSWERED' hardcoded). Filterable.
The Asterisk-auto-generated row from cdr_adaptive_odbc for the actual PJSIP outbound dial. This second row looks identical to a real inbound call: src=<customer phone>, dst='s', dcontext='org_xxx__incoming', disposition='NO ANSWER', channel PJSIP/org_xxx_trunk_xxx-000000XX.

The CDR poller's old direction logic at /opt/astrapbx/src/server.js:4297 was:

if (ch.includes('trunk') && (r.src || '').length >= 7) direction = 'inbound';

That matched both real inbound calls AND the auto-generated outbound bot ringback row. The poller forwarded the row to events.astradial.com/auto-ticket/{org_id} with direction: "inbound", LogsUpdate's classifier saw disposition=NO ANSWER + non-queue dcontext → created a missed_call ticket.

Diagnosis¶

Query the asterisk_cdr table for the suspicious phone number:

ssh root@82.180.146.80 \
  "mysql -upbx_api -ppbx_secure_password pbx_api_db -e \"
   SELECT id, calldate, src, dst, dcontext, channel, disposition
   FROM asterisk_cdr
   WHERE (src='<phone>' OR dst='<phone>')
     AND calldate > '<today>'
   ORDER BY calldate DESC LIMIT 10\\G\""

You should see two rows: one with dcontext='ai-outbound', disposition='ANSWERED', dst=<customer> (the real outbound) and one with dcontext='*_incoming', disposition='NO ANSWER', src=<customer> (the bug-triggering ringback row).

Fix¶

In the CDR poller pollCdr() function, build a set of phone numbers that received an ai-outbound row in the same poll batch, and skip any inbound-looking row whose src matches one. Built from allRows (pre-dedup) so the manual ai-outbound row isn't eaten by the linkedid dedup step.

const outboundBotDests = new Set();
for (const r of allRows) {
  if ((r.dcontext || '') === 'ai-outbound' && r.dst) {
    outboundBotDests.add(String(r.dst).replace(/\D/g, ''));
  }
}

// in the per-row classification loop:
const srcDigits = String(r.src || '').replace(/\D/g, '');
if (srcDigits && outboundBotDests.has(srcDigits)) {
  console.log('CDR poll: skip ' + r.id + ' — matches outbound bot dest ' + srcDigits);
  continue;
}

Live in /opt/astrapbx/src/server.js:4275-4317 since 2026-04-10.

Prevention Rule¶

Whenever you add a new outbound-call code path that goes through the trunk (AI bots, click-to-dial, originate-from-API, etc.), think about whether Asterisk's cdr_adaptive_odbc will write a ringback row that looks inbound, and add a same-batch correlation guard.

Error 21: Queue Edits in Editor Don't Save Most Fields¶

Symptom¶

Open Edit Queue in the editor → set Greeting, Ring Sound, Periodic Announce Frequency, Service Level, etc. → click Save → success toast → reload the queue → those fields are blank again. Only Name, Strategy, Timeout, Retry, Music On Hold, Recording Enabled, and Active stick.

Root Cause¶

PUT /api/v1/queues/:id in /opt/astrapbx/src/server.js had a hand-curated allowedFields whitelist with only 7 entries. Everything else was silently filtered out before reaching queue.update(updateData). Affected dropped fields included greeting_id, periodic_announce, ring_sound, announce_frequency, announce_position, all queue_* audio prompts, autopause, service_level, timeout_destination, and ~25 others.

Diagnosis¶

ssh root@82.180.146.80 \
  "mysql -upbx_api -ppbx_secure_password pbx_api_db -e \"
   SELECT name, periodic_announce, greeting_id, ring_sound, announce_frequency
   FROM queues WHERE id='<queue-id>'\\G\""

If the editor shows the field set but the DB row has it as NULL or default, the PUT endpoint dropped it.

Fix¶

Expand allowedFields to cover every editable column. Live in /opt/astrapbx/src/server.js:1876-1890 since 2026-04-10:

const allowedFields = [
  'name', 'strategy', 'timeout', 'retry', 'music_on_hold', 'recording_enabled', 'active', 'status',
  'max_wait_time', 'wrap_up_time', 'weight', 'max_callers', 'max_len',
  'greeting_id', 'periodic_announce', 'periodic_announce_frequency',
  'min_announce_frequency', 'relative_periodic_announce',
  'ring_sound', 'announce_frequency', 'announce_holdtime',
  'announce_position', 'announce_position_limit', 'announce_round_seconds',
  'autopause', 'autopausedelay', 'autopausebusy', 'autopauseunavail',
  'service_level', 'timeoutpriority', 'memberdelay',
  'join_empty', 'leave_when_empty', 'ring_inuse', 'ringinuse', 'reportholdtime',
  'queue_youarenext', 'queue_thereare', 'queue_callswaiting', 'queue_holdtime',
  'queue_minutes', 'queue_seconds', 'queue_thankyou', 'queue_reporthold',
  'timeout_destination', 'timeout_destination_type'
];

There's also a separate greeting_id → periodic_announce resolution step in the same endpoint: when greeting_id is provided, look up greetings.audio_file and write greetings/<basename> (without extension) to periodic_announce. This is what the Asterisk dialplan generator actually reads.

Prevention Rule¶

Don't use a Pick<>-style allowlist for any "edit this DB row" endpoint unless you generate it from the model definition. Hand-maintained whitelists silently rot as schemas grow. Future fix: switch to Object.keys(Queue.rawAttributes).filter(...) or similar, with a small explicit blocklist (id, org_id, created_at, updated_at).

Error 22: Queue Changes Don't Take Effect Until astrapbx Restart¶

Symptom¶

Save a queue change in the editor → success toast → on-disk /etc/asterisk/{ext,queues}_<org>.conf files are updated → but the live Asterisk dialplan and queue state still show the old configuration. New greetings don't play, new members don't ring, removed members are still in the live queue. Restarting pm2 restart astrapbx "fixes" it temporarily because Asterisk reloads on its own when configs are written during init.

Root Cause — Two Separate Bugs¶

1. deployOrganizationConfiguration() writes files but never tells Asterisk to reload them. The function regenerates pjsip_<org>.conf, ext_<org>.conf, and queues_<org>.conf, then returns. No dialplan reload, no queue reload all. Asterisk's in-memory config keeps the old version forever.

2. Even after adding core reload, queue static members didn't update. asterisk -rx "core reload" reloads dialplan, MOH, PJSIP, etc. — but it does NOT reload app_queue.so strongly enough to pick up new member => lines from queues.conf. You need module reload app_queue.so (equivalent: queue reload all) explicitly.

persistentmembers=yes in [general] makes this worse: dynamic members added via queue add member get persisted in astdb and survive across reloads, so the queue looks like it has members (the stale ones) and you don't notice the static members weren't loaded.

Diagnosis¶

# 1. Compare on-disk file to live dialplan — if file has changes the live doesn't, reload is broken
ssh root@82.180.146.80 'grep -A20 "5002 - Front Office" /etc/asterisk/ext_grandestancia.conf'
ssh root@82.180.146.80 'asterisk -rx "dialplan show 5002@org_mnd5khym__queue"'

# 2. Compare DB queue members to live queue members
ssh root@82.180.146.80 "mysql ... -e 'SELECT * FROM queue_members WHERE queue_id=\"<id>\"'"
ssh root@82.180.146.80 'asterisk -rx "queue show org_mnd5khym_5002"'

# 3. If they diverge, force-reload
ssh root@82.180.146.80 'asterisk -rx "module reload app_queue.so"'
ssh root@82.180.146.80 'asterisk -rx "queue show org_mnd5khym_5002"'   # should now match DB

Fix¶

Two changes in /opt/astrapbx/src/:

(a) services/asterisk/configDeploymentService.js:524 — reloadAsteriskConfiguration() now does both reloads:

await execAsync('asterisk -rx "core reload"');
await execAsync('asterisk -rx "module reload app_queue.so"');

(b) server.js — every queue PUT/POST/DELETE-member endpoint now calls await configDeploymentService.reloadAsteriskConfiguration() immediately after deployOrganizationConfiguration(). Three call sites: PUT queue (line 1942), POST member (line 2047), DELETE member (line 2105).

Prevention Rule¶

Whenever you add a new endpoint that modifies any DB row whose value ends up in an Asterisk config file, the chain MUST be: DB write → regenerate config files → reload Asterisk modules. The reload step is non-optional. Document and code-review for it.

For queue-specific changes, always explicitly reload app_queue.so in addition to core reload. Verified 2026-04-10 — core reload alone does NOT cover queues.conf static members.

Error 23: Queue Member Add/Remove Shows "Failed" Toast Despite Succeeding¶

Symptom¶

Click the X next to a queue member in the editor → toast says "Failed: Unexpected end of JSON input" or just "Failed" → but if you refresh the page, the member is actually gone. Same with "+ Add member..." in some flows: success in DB, error in UI. Users think the editor is broken.

Root Cause — Two Bugs Stacked¶

1. pbx/client.ts request helper called res.json() on every response. The DELETE member endpoint at /api/v1/queues/:queueId/members returns 204 No Content (res.status(204).send()). Calling .json() on a 204 response throws SyntaxError: Unexpected end of JSON input. The frontend's catch block fires showToast("Failed", "error") even though the DB delete succeeded.

2. The POST member endpoint had a body-shape mismatch. Frontend sent { user_ids: ["uuid"] } (plural array), backend destructured const { user_id, penalty } = req.body (singular). Backend read user_id = undefined, the User lookup failed, returned 400 {"error":"Invalid user"}. Frontend showed the toast but the row was never created — actually a real failure, not a 204 false positive.

Diagnosis¶

For Bug 1 (false positive on 204): open the browser DevTools Network tab, repeat the failing action, look for a DELETE request that returns 204 followed by a frontend-only error in the Console.

For Bug 2 (real POST failure): hit the endpoint directly with curl and the frontend-style payload to confirm the 400:

curl -X POST 'https://devpbx.astradial.com/api/v1/queues/<id>/members' \
  -H 'X-API-Key: org_xxx' -H 'Content-Type: application/json' \
  -d '{"user_ids":["<uuid>"]}'
# Old behavior: 400 {"error":"Invalid user"}
# New behavior: 201 {"created":[...]}

Fix¶

(a) Frontend — /opt/pipecat-flow-editor/lib/pbx/client.ts:37:

async function request<T>(path: string, opts: RequestInit = {}): Promise<T> {
  const res = await fetch(`${BASE}${path}`, { ...opts, headers: headers() });
  if (!res.ok) { /* throw */ }
  if (res.status === 204) return undefined as unknown as T;   // ← NEW
  return res.json();
}

(b) Backend — /opt/astrapbx/src/server.js:1986 (POST member). Accept both user_id (single, legacy) and user_ids (array, batch). Validate all in one query, skip already-existing members instead of erroring out the whole batch, deploy + reload config exactly once at the end. Returns the single member object for legacy single calls (back-compat) or {created, skipped} for batch calls.

(c) UI dropdown reset — /opt/pipecat-flow-editor/app/dashboard/[orgId]/queues/page.tsx:599 — added a key prop on the "+ Add member..." <Select> so it remounts after each successful add and the placeholder returns:

<Select key={`add-member-${editingQueue?.members?.length ?? 0}`} onValueChange={...}>

Without that, the uncontrolled Select held the just-picked value internally and refused to fire onValueChange on the next click, blocking sequential adds.

Prevention Rule¶

Any HTTP client wrapper that handles arbitrary endpoints must branch on Content-Length: 0 or 204 No Content before calling .json(). Same applies to error responses with no body. Recommended pattern:

if (res.status === 204 || res.headers.get('content-length') === '0') {
  return undefined as unknown as T;
}

For request/response shape mismatches, write a one-liner curl test against the deployed endpoint as part of any frontend client change. The toast saying "Failed" is not enough information — always check the actual HTTP status and body in DevTools Network tab.

Error 24: VPS Path is `/opt/astrapbx` (lowercase), not `/opt/AstraPBX`¶

Symptom¶

Following docs/guides/deploy-apps.md, ssh root@82.180.146.80 'ls /opt/AstraPBX/src/' returns "No such file or directory". Searching for code with grep -r "..." /opt/AstraPBX/ yields nothing.

Root Cause¶

The directory on the VPS is /opt/astrapbx (all lowercase). The doc had it as /opt/AstraPBX (camel case). PM2 confirms via pm2 info astrapbx | grep "script path" → /opt/astrapbx/src/server.js.

Fix¶

Always use /opt/astrapbx. The deploy guide is being updated to reflect this. Until then, treat the case in any doc as untrustworthy and verify with:

ssh root@82.180.146.80 'pm2 info astrapbx | grep -E "script path|exec cwd"'

Prevention Rule¶

Don't trust path documentation — always verify against the running process. Add a CI check or doc-test that reads pm2 info and asserts the documented paths exist.

Error 25: NUC tata-endpoint CHANUNAVAIL After Asterisk Restart¶

Symptom¶

After restarting NUC Asterisk, all outbound calls via Dial(PJSIP/...@tata-endpoint) return CHANUNAVAIL. Inbound calls from Tata still work. Contact shows NonQual.

Root Cause¶

Asterisk 22 with qualify_frequency=0 and max_contacts=0 on the tata-aor AOR results in the static contact being NonQual forever after restart. The Dial() application requires the contact to be Available to create an outbound channel. Inbound works because Tata initiates the SIP dialog.

Diagnosis¶

sudo asterisk -x "pjsip show contacts" | grep tata
# Shows: tata-aor/sip:10.79.215.102:5060 ... NonQual  -nan
# Must show "Avail" for outbound to work

sudo asterisk -x "channel originate PJSIP/919944421125@tata-endpoint application Wait 10"
# Returns immediately with 0 active channels = CHANUNAVAIL

Fix¶

Enable qualify and set max_contacts on the NUC's tata-aor in /etc/asterisk/pjsip.conf:

[tata-aor]
type=aor
contact=sip:10.79.215.102:5060
qualify_frequency=30        ; was 0 — must be >0 for Avail status
max_contacts=1              ; was 0 — required for outbound channel creation

Reload: sudo asterisk -x "module reload res_pjsip.so" and wait 30 seconds for qualify.

Prevention Rule¶

Any PJSIP AOR that needs outbound dialing MUST have qualify_frequency > 0 and max_contacts >= 1. Without these, outbound Dial() silently fails with CHANUNAVAIL after any Asterisk restart. This is an Asterisk 22 regression — older versions were more lenient.

Error 26: Staging Outbound Calls — Wrong Number Format for Tata NNI¶

Symptom¶

Staging outbound calls reach NUC from-cloud context but Tata returns 403 Forbidden or the call makes progress then drops after 20 seconds.

Root Cause¶

Tata NNI expects specific number and CallerID formats:

What	Wrong format	Correct format
CallerID	`CALLERID(num)=+91${CALLERID(num):1}` → `+91918065978001` (double country code)	`CALLERID(all)=+918065978001`
Dial number	`PJSIP/919944421125@tata-endpoint` (with 91 prefix)	`PJSIP/09944421125@tata-endpoint` (with 0 prefix = local format)

Using 91 prefix: Tata SBC accepts the INVITE (100 Trying) but can't route it → drops after 20s. Using 0 prefix: Tata routes successfully → phone rings.

Fix¶

NUC /etc/asterisk/extensions.conf [from-cloud]:

[from-cloud]
exten => _X.,1,NoOp(Cloud Outbound via Tata: ${EXTEN} CID: ${CALLERID(all)})
 same => n,Set(CALLERID(all)=+918065978001)
 same => n,Dial(PJSIP/0${EXTEN}@tata-endpoint,60)
 same => n,NoOp(Tata dial status: ${DIALSTATUS})
 same => n,Hangup()

Key points: - Use CALLERID(all) not CALLERID(num) — sets both name and number atomically - Hardcode a valid Tata DID as CallerID - Use 0${EXTEN} prefix for local dialing format (what Tata NNI expects for outbound)

Prevention Rule¶

Always test outbound number formats using NUC's testonly-outbound context first. Check DIALSTATUS in the log — CHANUNAVAIL means AOR/qualify issue, busy/congested means format rejected.

Error 27: rsync --delete Removes Server-Side Files on Deploy¶

Symptom¶

After CI/CD deploy, firebase-sa-key.json disappears from the API server. Firebase auth breaks with "Invalid Firebase token". Editor .env files also removed.

Root Cause¶

The CI/CD workflow uses rsync -a --delete which removes files on the destination that don't exist in the source. Server-only files (firebase-sa-key.json, .env, .env.local, recordings) are not in git and get deleted.

Fix¶

Exclude all server-side files in rsync:

# API deploy
rsync -a --delete \
  --exclude=node_modules \
  --exclude=.git \
  --exclude=.env \
  --exclude=.env.local \
  --exclude='*.log' \
  --exclude=firebase-sa-key.json \
  --exclude=recordings/ \
  api/ /opt/astrapbx/

# Editor deploy
rsync -a --delete \
  --exclude=node_modules \
  --exclude=.git \
  --exclude=.next \
  --exclude=.env \
  --exclude=.env.local \
  editor/ /opt/pipecat-flow-editor/

Prevention Rule¶

Every rsync --delete in a CI/CD workflow must have an --exclude for every file that exists only on the server. Keep a checklist: .env, .env.local, firebase-sa-key.json, recordings/, any uploaded media. Test deploys on staging first.

Error 28: `request-org` Self-Serve Crashing with notNull Violation¶

Date: 2026-04-17
Severity: High — blocked every client self-serve org signup

Symptoms¶

POST /api/v1/auth/request-org returns 500. pm2 logs show:

request-org error: notNull Violation: Organization.context_prefix cannot be null,
notNull Violation: Organization.api_key cannot be null

50+ of these over an hour on prod (= 50 failed signups).

Root Cause¶

The self-serve handler's Organization.create() was passing only name, status, api_secret, contact_info — missing context_prefix and api_key which are allowNull: false on the model.

Fix¶

Added the missing fields:

context_prefix: generateContextPrefix(),
api_key: `org_${uuidv4().replace(/-/g, '')}`,
domain: `${org_name.toLowerCase().replace(/[^a-z0-9]/g, '')}.local`,

Prevention¶

Every Organization.create() path must include all notNull fields. Add a unit test that hits both auth/request-org and POST /organizations with minimal bodies.

Error 29: Admin create-org Leaves Org Unusable (no org_users owner)¶

Date: 2026-04-17

Symptoms¶

Admin creates an org via UI → org appears in admin list → client can't log in with their email → Firebase auth returns "User not found".

Root Cause¶

POST /api/v1/organizations (admin flow) created the organisation + attempted a SIP users row but never inserted an org_users row. The org_users table is what the editor queries for Firebase login (via /api/v1/auth/user-login). No owner row → no login possible.

The SIP user creation also failed silently with "password_hash cannot be null" because the handler set password instead of password_hash and missed asterisk_endpoint.

Fix¶

After creating the organisation, insert an org_users row with role='owner', email=contact_info.email, status='active'. Also fixed the SIP user creation to hash the password and set asterisk_endpoint=<context_prefix>1001.

Prevention¶

Pair admin-created orgs with at least one owner login. Optionally send the owner an invite email with the Firebase sign-up link.

Error 30: Duplicate DID Records — Same Number, Different Formats/Orgs¶

Date: 2026-04-17

Symptoms¶

Inbound calls to a DID land on the wrong org. For +91 80659 78002, the call went to TechStart instead of GrandEstancia.

Root Cause¶

did_numbers had two rows for the same physical number:

08065978002 → GrandEstancia (Indian local format, created later)
+918065978002 → TechStart (E.164, created earlier by a test setup)

Tata sends inbound as 918065978002, so the dispatcher matched the TechStart record first.

Fix¶

Deleted the duplicate row (the TechStart one). Extended the dispatcher generator to emit format aliases (08... AND 91...) for every Indian DID so customers dialing either way reach the same org.

Prevention¶

Enforce a canonical format when inserting DIDs. Suggested: strip all non-digit chars on insert, prepend +91 if it starts with 0. The generator handles both formats for display compatibility.

Error 31: Queue Calls Fall Through to AI Agent Without Ringing Members¶

Date: 2026-04-17
Severity: High — real customer inbound calls to GE's reservation queue going to AI instead of agents

Symptoms¶

Customer dials +91 80659 78002 → hears silence briefly → routed to AI voice agent (ext 1003) instead of the human reservation queue (ext 5003). No MOH during wait, no ring to agents.

Root Cause¶

The generated ext_grandestancia.conf queue 5003 section had no Answer() before Queue(). Queue 5001/5002 worked because they had greetings and therefore got Answer(). Queues without greetings (5003, 5004, 5005, 5006) skipped it.

Without Answer(), Queue() runs on an unanswered channel:

Caller hears upstream ringback, no MOH
Queue announces "your hold time is X" to the member when they pick up (the leg isn't properly bridged)
Member hangs up because they just hear the announce
Queue fails → fallback routes to the configured timeout destination (ext 1003 = AI Agent)

Fix¶

Dialplan generator now emits Answer() + Wait(0.5) unconditionally before Queue():

exten => 5003,n,Set(CHANNEL(musicclass)=org_mnd5khym__default)
exten => 5003,n,Answer()            ; always
exten => 5003,n,Wait(0.5)            ; settle before Queue() runs
exten => 5003,n,Queue(org_mnd5khym_5003,ct,,,45)

Hot-fixed prod live, then committed to main so future config regenerations preserve it.

Prevention¶

Always Answer() before Queue() regardless of greeting. If a greeting exists, play it after the answer. Asterisk will not play MOH or ring members properly on an unanswered channel.

Error 32: Staging Inbound Forwarding Failed — PJSIP auth ids available¶

Date: 2026-04-17

Symptoms¶

After the monorepo cutover, calls to +91 80659 78001 (routing_environment='staging') were reaching prod's dispatcher and getting Dial(PJSIP/918065978001@cloud-endpoint-stage) — but ringing then failing with:

ERROR[...] res_pjsip_outbound_authenticator_digest.c:
  cloud-endpoint-stage:10.10.10.3: There were no auth ids available

Root Cause¶

Staging's tata_gateway_identify PJSIP rule only matched 10.10.10.2 (NUC). When prod started forwarding from 10.10.10.1 (its WireGuard IP), staging didn't recognise the source → treated the INVITE as unauthenticated.

Fix¶

Added match=10.10.10.1 to staging's tata_gateway_identify in /etc/asterisk/pjsip_tata_gateway.conf:

[tata_gateway_identify]
type=identify
endpoint=tata_gateway
match=10.10.10.2
match=10.10.10.1      ; prod cloud can now forward calls in

Reload with pjsip reload.

Prevention¶

Any new source IP that forwards calls into a staging/prod Asterisk must be added to the relevant identify rule. Document source IPs per tunnel.

Error 33: NUC Clobbers Outbound Caller ID to 78001¶

Date: 2026-04-17
Severity: High — customer-visible (recipients saw wrong caller ID for GE outbound)

Symptoms¶

Customers receiving calls from GrandEstancia's agents saw +91 80659 78001 (Hari Surya's number) instead of GE's 08065978002. This was since the monorepo cutover.

Root Cause¶

The NUC's from-cloud context validates the incoming caller ID against our owned Tata range 918065978000-029. If the CID is outside that range, it falls back to +918065978001 (the platform default).

GE's dialplan set CALLERID(num)=08065978002 (Indian local format, 11 digits starting 0). NUC's range check was:

IN_CID:0:8 must equal "91806597"   ← "08065978" doesn't match

→ fell through to default +918065978001.

Fix¶

Added a format normalisation step BEFORE the range check in NUC's from-cloud:

exten => _X.,n,GotoIf($["${LEN(${IN_CID})}" = "11" & "${IN_CID:0:1}" = "0"]?cid_normalize:cid_skip_norm)
exten => _X.,n(cid_normalize),Set(IN_CID=91${IN_CID:1})
exten => _X.,n(cid_skip_norm),GotoIf(...)  ; existing range check

Now 08065978002 → 918065978002 → in range → passes as +918065978002. Recipients see the right number.

Prevention¶

Outbound CALLERID should ideally always be in E.164 format before it reaches the NUC. The NUC is defence-in-depth. Long-term: normalise CID in the per-user dialplan generator so NUC never has to translate.

Error 34: Admin UI "Environment" Dropdown Silently Ignored¶

Date: 2026-04-17

Symptoms¶

Admin selects Prod / Staging / OSS in the DID Management /admin/dids dropdown → UI shows success → nothing changes in the dispatcher → reload the page and the value reverts.

Root Cause¶

PUT /api/v1/did-pool/admin/:id has an explicit allowed-fields list:

const allowed = ['description', 'region', 'provider', 'monthly_price', 'status', 'trunk_id'];

routing_environment was missing → the PUT returned 200 but never persisted the field.

Fix¶

Added routing_environment to the allowed list. Pattern note: every new column that admin UIs can edit must be explicitly allowed in this endpoint's filter.

Prevention¶

When adding a new editable column, grep for the model name in route files and add the column to all allowed-field filters. Consider deriving the allowed list from the model schema automatically.

Error 35: Dispatcher Generator Required an Org for Every DID¶

Date: 2026-04-17

Symptoms¶

To route a staging DID (e.g., +91 80659 78001) from prod to staging, prod's DB needed an org that owned the DID. This forced us to keep shell/test orgs on prod just to preserve inbound flow to staging — couldn't clean up prod's DB.

Root Cause¶

Generator loop skipped DIDs without an organization include:

if (!did.organization) continue;

Fix¶

Collect orphan DIDs separately and emit Dial(PJSIP/<did>@cloud-endpoint-stage) for those marked routing_environment='staging'. Non-staging orphans are logged as a warning and skipped.

Prevention¶

Keep the generator tolerant of partial data. The alternative — requiring a complete org graph — couples prod's tenant list to staging's forwarding needs, which is the wrong direction of dependency.

Error 36: Cross-Org Call Data Leak via Unscoped `ai-outbound` Clause¶

Date: 2026-04-18 Severity: Critical (cross-tenant data leak) Exposure: Platform-authenticated admin users of one tenant could see another tenant's AI-outbound call metadata (src, dst, time, duration, disposition). No external exposure.

Symptoms¶

Within minutes of approving a brand-new org ("Zauto AI", org_id 728e57ec-…, zero call activity), the owner's dashboard Recent Calls panel listed GrandEstancia's AI outbound calls (from 08065978002 to various 9xxxxxxxxx numbers, timestamps matching GE's actual activity).

Root Cause¶

GET /api/v1/calls and GET /api/v1/calls/history used this WHERE clause for org ownership:

(t.accountcode = ? OR t.peeraccount = ? OR t.channel LIKE ? OR t.dcontext = 'ai-outbound')

The OR t.dcontext = 'ai-outbound' was unscoped — it matched AI-outbound rows regardless of which org originated them. Any authenticated org admin querying /calls received every AI-outbound CDR platform-wide.

Verified on the leaked rows: accountcode = ba50c665-… (GE's id), but the unconditional dcontext clause let them match Zauto's query.

Fix¶

Dropped the unscoped clause. AI-outbound rows are still matched via accountcode, which the workflow engine always sets on Originate:

(t.accountcode = ? OR t.peeraccount = ? OR t.channel LIKE ?)

PRs #54 (staging) + #55 (prod) — deployed 2026-04-18 10:53 UTC.

Verification¶

curl /api/v1/calls?org_id=<zauto-id> → total: 0 (correct — brand-new org, no calls)
curl /api/v1/calls?org_id=<ge-id> → total: 457 (GE's own calls, unchanged)

Prevention¶

Any future CDR query must org-scope every OR branch in the WHERE clause. Never add a fallback that widens org ownership to "everyone for this dcontext/route/endpoint type".
Follow-up audit required for: /api/v1/calls/:linkedId/journey (currently unscoped), /api/v1/calls/:callId/recording (scoped via CDR row lookup, but worth re-verifying), and any future endpoint touching asterisk_cdr.
Consider a helper orgScopedCdrWhere(orgId, prefix) to centralise the predicate so it can't drift per-endpoint.

Error 37: Queue Save Returns 500 — `timeout_destination_type: "phone"` Not in DB ENUM¶

Date: 2026-04-21

Symptoms¶

PUT /api/v1/queues/:id returns 500. The queue form in the editor has "Phone Number" selected as the Timeout Destination type. Other queues save fine.

Root Cause¶

The editor's queue form offers "phone" as a timeout_destination_type option. The queues table ENUM only allowed ('extension','queue','ivr','external','hangup'). When the editor sent timeout_destination_type: "phone", Sequelize's queue.update() threw a DB ENUM constraint error, which the catch block converted to a 500 response.

The dialplan generator (dialplanGenerator.js:748) already handled "phone" correctly — only the DB schema was missing it.

Fix¶

ALTER TABLE queues MODIFY timeout_destination_type
  ENUM('extension','queue','ivr','external','hangup','phone') NULL DEFAULT NULL;

Applied directly on prod MariaDB. No restart needed.

Prevention Rule¶

Whenever the editor adds a new ENUM value to any form dropdown, check that the corresponding DB ENUM column includes that value before shipping. The dialplan generator and API allowedFields must all agree with the DB schema.

Error 38: MOH Upload Returns 413 Request Entity Too Large¶

Date: 2026-04-21

Symptoms¶

POST /api/pbx/moh/upload returns 413. The MOH upload dialog in the editor fails immediately when a user selects an audio file.

Root Cause¶

The nginx client_max_body_size defaults to 1 MB when unset. Audio files for MOH (.wav, .mp3) are typically 2–10 MB. Nginx rejects the upload before the request reaches astrapbx.

Fix¶

Added client_max_body_size 50M; to the /api/pbx/ location block in /etc/nginx/sites-enabled/editor.astradial.com:

location ~ ^/api/pbx/(.*) {
    client_max_body_size 50M;
    set $upstream_uri /api/v1/$1;
    proxy_pass http://127.0.0.1:8000$upstream_uri$is_args$args;
    ...
}

Reload: systemctl reload nginx

Prevention Rule¶

Any nginx location block that proxies file-upload endpoints must have an explicit client_max_body_size. Default 1 MB is never appropriate for audio, image, or document uploads.

Error 39: Greetings TTS Play Button Missing — `audio_file` is NULL¶

Date: 2026-04-21

Symptoms¶

A greeting is created successfully (shown in the list), but no play button appears. The greeting's audio_file column is NULL in the DB.

Root Cause¶

The POST /api/v1/greetings endpoint only created the DB record; it never called TTSService.saveGreetingAudio(). The TTSService existed at src/services/ttsService.js but was not wired into the creation endpoint.

The editor hides the play button with {g.audio_file && <Button ...>}, so it simply doesn't render for greetings without audio.

Fix¶

Updated POST /api/v1/greetings in server.js to call TTSService.saveGreetingAudio() immediately after creating the DB record:

audio_file = await tts.saveGreetingAudio(id, text, language, voice);
greeting = await Greeting.create({ id, ..., audio_file });

TTS failure is caught silently so the greeting is still created (without audio) if Google TTS is unavailable.

Also added: - PUT /api/v1/greetings/:id — updates metadata; if text/voice/language changes, regenerates TTS audio - DELETE /api/v1/greetings/:id — deletes DB record and audio file

Existing greetings with audio_file: NULL can be fixed by calling PUT with the same text to trigger a regeneration.

Prevention Rule¶

Any endpoint that creates a resource with an associated file must generate that file in the same request. Don't create the DB record first and rely on a follow-up step — if the follow-up fails, the record is permanently orphaned without an obvious error.

Error 40: IVR Greeting Silently Fails — Call Hangs Up with No Audio¶

Symptom¶

Inbound call to a DID routed to an IVR connects, but no greeting plays and the call drops after a few seconds.

Diagnosis¶

grep Background /etc/asterisk/ext_<org>.conf | head
# If you see: Background(greeting_ivr_<uuid>)    ← WRONG (bare filename)
# Should be:  Background(greetings/greeting_ivr_<uuid>)

ls /var/lib/asterisk/sounds/greetings/greeting_ivr_<uuid>.wav
# File must exist.

Root Cause¶

TTSService.saveGreetingAudio() writes to /var/lib/asterisk/sounds/greetings/<prompt>.wav, but Asterisk's Background() only searches the language subdir under astsoundsdir (e.g. /var/lib/asterisk/sounds/en/) — NOT the greetings/ subdir. A bare filename fails silently; Asterisk logs no error, just moves past.

Compound cause: if the user set greeting_text via the UI but "Generate greeting" failed or wasn't clicked, the DB has greeting_prompt set but no .wav file exists.

Fix¶

Fixed in api/src/services/asterisk/dialplanGenerator.js:569:

extension += `exten => ${ivr.extension},n(start),Background(greetings/${ivr.greeting_prompt})\n`;

After deploying the code fix, regenerate each org's dialplan via configDeploymentService.deployOrganizationConfiguration(orgId, name).

To regenerate a missing .wav, run TTS directly or have the user click "Generate greeting" in the IVR UI:

const tts = new TTSService();
await tts.saveGreetingAudio(`ivr_${ivrId}`, text, language, voice);

Error 41: SIP Phone → IVR Extension Returns SIP 404¶

Symptom¶

Zoiper or other softphone registered against the PBX dials an IVR extension (e.g. 7002) and gets Not Found (code: 404). The IVR works for external callers via DID routing but not from internal SIP.

Diagnosis¶

asterisk -rx "dialplan show 7002@<prefix>__internal"
# If: "There is no existence of context 'X'" or only matches _X. wildcard,
# the _ivr context is not included.

grep -A 5 "^\[<prefix>__internal\]" /etc/asterisk/ext_<org>.conf
# Should show: include => <prefix>__ivr

Root Cause¶

generateInternalContext historically only included _outbound and _queue. IVR extensions only exist in _ivr, so internal SIP callers had no way to reach them.

Fix¶

Fixed in dialplanGenerator.js:114-120 — added include => <prefix>_ivr AND reordered includes so exact-match contexts come first (see Error 47). Regenerate org dialplan after deploy.

Error 42: pjsip Endpoint Rejected — `Could not find option suitable for category`¶

Symptom¶

Outbound calls fail with pjsip error. Asterisk log shows:

ERROR[...] config_options.c: Could not find option suitable for category
  '<endpoint-name>' named 'system_trunk' at line N of pjsip_<org>.conf
ERROR[...] res_sorcery_config.c: Could not create an object of type
  'endpoint' with id '<endpoint-name>' from configuration file 'pjsip.conf'

asterisk -rx "pjsip show endpoint <endpoint-name>" returns "Unable to find object". Any Dial(PJSIP/num@<endpoint-name>) dies.

Diagnosis¶

mariadb ... -e "SELECT name, configuration FROM sip_trunks WHERE asterisk_peer_name='<endpoint-name>'"
# If configuration contains non-pjsip keys like system_trunk, nuc_gateway,
# channels, max_channels, routing_environment, notes — that's the bug.

Root Cause¶

sipTrunkService.js previously splatted EVERY key from the sip_trunks.configuration JSON column verbatim as a pjsip endpoint option. When ops added metadata fields to that column (intended for admin UI display), pjsip rejected the whole endpoint because system_trunk=true is not a valid option.

Fix¶

Fixed in api/src/services/asterisk/sipTrunkService.js:109-131 — added a METADATA_KEYS deny list:

const METADATA_KEYS = new Set([
  'system_trunk', 'nuc_gateway', 'channels', 'max_channels',
  'routing_environment', 'notes',
]);
if (trunk.configuration && typeof trunk.configuration === 'object') {
  Object.entries(trunk.configuration).forEach(([key, value]) => {
    if (METADATA_KEYS.has(key)) return;
    config += `${key}=${value}\n`;
  });
}

After code deploy, regenerate the affected org's pjsip config and asterisk -rx "pjsip reload".

Error 43: Admin Impersonation — Users Empty, No Auto-Logout After 24h¶

Symptom¶

Admin impersonates an org, comes back 24h+ later. Dashboard loads but shows empty users list, zero call stats, no "Session expired" redirect. API calls return 401 silently.

Diagnosis¶

In browser devtools on editor.astradial.com:

localStorage.getItem('pbx_org_token_exp')          // compare to Date.now() — expired?
localStorage.getItem('gateway_admin_key')          // truthy = admin session active
JSON.parse(localStorage.getItem('org_access')).impersonating  // true = admin impersonating

If all three signal impersonation-with-expired-JWT but no redirect fires, the watcher isn't handling this case.

Root Cause¶

AuthExpiryWatcher and handleUnauthorized both bailed when any admin key was present, ignoring the fact that an impersonating admin ALSO has a PBX JWT that expires in 24h. The impersonation JWT silently expired, 401s were swallowed by the catch handlers, and the UI showed empty data.

Fix¶

Fixed in PR #62:

AuthExpiryWatcher.scheduleFromStorage() now schedules a timer whenever pbx_org_token exists, regardless of admin-key state.
handleUnauthorized distinguishes three session types:
Normal org user → full logout + Firebase signOut.
Admin impersonating → clear ONLY impersonation state (pbx_org_token*, org_access, user_role, user_permissions), redirect to /dashboard. Keep Firebase + gateway_admin_key intact.
Pure admin (no JWT) → swallow 401 as before (admin auth uses a different mechanism).

Also added admin_session_start stamped at admin login + handleAdminSessionExpiry for a separate 24h admin-session auto-logout that DOES sign out of Firebase.

Error 44: Outbound E.164 Number with `+` Prefix — "Extension Not Found"¶

Symptom¶

Softphone (Zoiper, Bria, etc.) dials +919944421125. Asterisk logs:

Call (UDP:.../...) to extension '+919944421125' rejected because
  extension not found in context '<prefix>__internal'.

Dialling the same number without + (919944421125) works.

Root Cause¶

Asterisk pattern matching treats + as a literal character, not a digit. _X. in outbound context matches 9944421125 (starts with digit) but does NOT match +919944421125 (starts with +). No rule matched → 404.

Fix¶

Fixed in dialplanGenerator.js generateOutboundContext — added a catch-all pattern at the top of the outbound context:

exten => _+X.,1,NoOp(Stripping leading + from ${EXTEN})
exten => _+X.,n,Goto(${EXTEN:1},1)

This matches any +<digits>, strips the +, and re-enters the dialplan at the same context with the digits-only form. Internal context includes _outbound, so SIP phones get it for free.

Error 45: Staging Outbound Calls Get Congestion / 403 Forbidden¶

Symptom¶

SIP phone on staging dials a PSTN number. Staging logs show the call reaches Dial(PJSIP/num@<trunk>,60), sends INVITE to prod (10.10.10.1:5060) over WireGuard, then Asterisk reports "Everyone is busy/congested at this time (1:0/0/1)" and returns 403 to the caller. Prod logs show cloud-endpoint-stage matched the incoming INVITE but the call dies in the from-cloud context.

Diagnosis¶

# On prod:
cat /etc/asterisk/ext_from_cloud.conf
# Check the Goto target — must be a context that exists.

asterisk -rx "dialplan show <target-context>"
# If "There is no existence of context" — that's the bug.

Root Cause¶

Prod's ext_from_cloud.conf Goto'd a per-org outbound context (org_mna9x47k__outbound) that was never generated on prod because the corresponding org was never provisioned there. Asterisk silently reported congestion because Dial couldn't resolve anything.

Fix¶

Prod hand-edit (documented here because ext_from_cloud.conf is not in the monorepo — see Prod Direct-Edit):

 [from-cloud]
 exten => _X.,1,NoOp(Staging Cloud Outbound: ${EXTEN} from ${CALLERID(all)})
 same => n,Set(CALLERID(num)=+918065978001)
 same => n,Set(CALLERID(name)=AstraPrivate)
-same => n,Goto(org_mna9x47k__outbound,${EXTEN},1)
+same => n,Goto(staging-outbound,${EXTEN},1)

staging-outbound (in ext_staging_outbound.conf) correctly dials PJSIP/${EXTEN}@tata_gateway. Hot-reload via asterisk -rx "dialplan reload".

Prevention Rule¶

Any Goto target in a hand-maintained Asterisk conf must be verified via dialplan show <target> before reload. Asterisk does NOT validate Goto targets at load time — bad targets only surface at call time as congestion.

Error 46: Dialplan Regen Fails — `Unknown column 'ivrs.greeting_text'`¶

Symptom¶

configDeploymentService.deployOrganizationConfiguration throws when called on prod or a target env:

SQLState: 42S22) Unknown column 'ivrs.greeting_text' in 'SELECT'

Staging works fine for the same code; only this env fails.

Root Cause¶

Schema drift. The Sequelize model's SELECT includes columns that were added to staging via migration-like ALTER commands but never propagated to the target env. Idempotent ALTERs in migration files (ADD COLUMN IF NOT EXISTS) are safe to re-run.

Diagnosis¶

# Diff schemas between healthy env and broken env:
ssh root@healthy 'mariadb ... -e "DESCRIBE ivrs"' > /tmp/healthy.txt
ssh root@broken  'mariadb ... -e "DESCRIBE ivrs"' > /tmp/broken.txt
diff /tmp/healthy.txt /tmp/broken.txt

Fix¶

Re-run the relevant migration file (all IVR migrations are idempotent):

scp api/database/migrations/<migration>.sql root@<vps>:/tmp/
ssh root@<vps> 'mariadb -u<user> -p<pw> pbx_api_db < /tmp/<migration>.sql'

Prevention Rule¶

Before deploying code that depends on a schema change, diff schema between the healthy env and the target env and re-run any missing migrations first. CI/CD does NOT run migrations — it's deliberately a human step, precisely because auto-migration of an out-of-date prod is a footgun.

Error 47: Queue/IVR Dialed from Internal SIP Silently Goes to PSTN¶

Symptom¶

SIP phone on <org>_internal context dials a known queue number or IVR extension (e.g. 5001 or 7002). Instead of reaching the queue/IVR, the call goes OUT to PSTN with the number as destination. Asterisk happily bills the minute against the trunk.

Diagnosis¶

asterisk -rx "dialplan show <number>@<prefix>__internal"
# If the match shown is '_X.' from _outbound context, the wildcard is
# winning over the exact match in _queue or _ivr.

grep "include => " /etc/asterisk/ext_<org>.conf
# If order is: _outbound then _queue then _ivr — that's the bug.

Root Cause¶

Asterisk searches includes in declaration order and returns on the first include that has a match. Specificity (exact vs pattern) works WITHIN a single include but does NOT cross include boundaries. If _outbound is first, its _X. catches any digit sequence before _queue/_ivr get searched — even though those have exact matches.

Fix¶

Fixed in dialplanGenerator.js:114-120 — reordered to put exact-match contexts first:

include => <prefix>_ivr       # exact IVR extensions
include => <prefix>_queue     # exact queue numbers
include => <prefix>_outbound  # _X. wildcard — last

Historical detail: prod orgs that had queues (Om Chamber 5001, etc.) had been silently broken for internal-dial-to-queue since launch. Nobody reported it because users normally reach queues via DID routing, not internal dial. The fix corrects the behaviour; no config change is needed on the call-flow side.

Prevention Rule¶

In every include chain, wildcard-match contexts must come LAST. Put exact-match (queue/IVR) before _X.-style patterns. Same rule applies to any new context types added later.

Error 48: DID Edit Dialog — IVR Dropdown Missing / Free-Text Input Deselects Per Keystroke¶

Symptom¶

Two tightly linked UI bugs in the DIDs edit dialog on the editor:

Selecting routing_type=ivr shows a free-text Input instead of a dropdown listing IVRs. Users type the IVR's extension number, save, then calls to the DID return "number not in service".
Typing in any free-text Destination input (external / ai_agent) deselects the cursor after every single keystroke.

Root Cause¶

Both bugs share a root cause: DestinationField was declared INSIDE the DidsPage component function. Every render of DidsPage creates a new function reference; React treats each render's DestinationField as a new component and remounts the <Input> / <Select>, dropping focus.

The IVR dropdown was missing entirely — no ivr branch in the conditional. Users fell through to the free-text Input and typed the extension. The dialplan generator looks up IVRs by UUID (org.ivrs.find(i => i.id === did.routing_destination)), so extension-number strings never match and calls fail with number-not-in-service.

Fix¶

Fixed in editor/app/dashboard/[orgId]/dids/page.tsx:

Hoisted DestinationField out of DidsPage to a top-level component with userList / queueList / ivrList passed as props (not closures).

Added ivr branch:

if (routingType === "ivr") {
  return (
    <Select value={value} onValueChange={onChange}>
      <SelectTrigger><SelectValue placeholder="Select IVR" /></SelectTrigger>
      <SelectContent>
        {ivrList.filter(i => i.status === "active").map((i) => (
          <SelectItem key={i.id} value={i.id}>{i.extension} — {i.name}</SelectItem>
        ))}
      </SelectContent>
    </Select>
  );
}

Added displayDestination helper so the DIDs table shows "<ext> — <name>" instead of the raw UUID.

Prevention Rule¶

Never declare a sub-component inside another component function. Either (a) hoist it outside, (b) wrap it in useCallback/useMemo with correct deps, or (c) accept that it'll remount on every parent render — only acceptable for stateless read-only presentations.

Error 49: Upptime CI — `Cannot read properties of undefined (reading 'tag_name')`¶

Symptom¶

Roughly half of the scheduled Uptime CI, Response Time CI, and Graphs CI runs in astradial/upptime fail within ~15 seconds with:

ERROR TypeError: Cannot read properties of undefined (reading 'tag_name')
    at getUptimeMonitorVersion (.../uptime-monitor/v1.41.0/webpack:/@upptime/uptime-monitor/dist/helpers/workflows.js:17)

The other half succeed. status.astradial.com updates intermittently, with gaps in coverage.

Root Cause¶

upptime/uptime-monitor@v1.41.0 calls octokit.repos.listReleases({ owner: "upptime", repo: "uptime-monitor", per_page: 1 }) and accesses releases.data[0].tag_name with no nil-check. The GitHub API endpoint GET /repos/upptime/uptime-monitor/releases flaps — sometimes it returns the release list, sometimes []. Hand-verified: same auth, two consecutive calls, one full payload and one empty array. When the action lands on the empty response, it crashes.

Upstream upptime/uptime-monitor has issues disabled and last released 2025-09-04; the project is effectively unmaintained.

Diagnosis¶

Reproduce the flap directly:

for i in 1 2 3 4 5; do
  echo "Try $i:"
  gh api 'repos/upptime/uptime-monitor/releases?per_page=1' --jq '.[0].tag_name // "EMPTY"'
done

Inspect the bundled helper that crashes:

gh api repos/upptime/uptime-monitor/contents/dist/helpers/workflows.js \
  --jq '.content' | base64 -d | head -25

Fix¶

Forked the action to astradial/uptime-monitor and patched the bundled dist/index.js directly (single line, surgical edit — no rebundle):

// Before (around line 1214):
release = releases.data[0].tag_name;

// After:
try {
  release = releases.data[0]?.tag_name ?? "v1.41.0";
} catch {
  release = "v1.41.0";
}

Shipped as v1.41.3-astradial. All 5 workflows in astradial/upptime pin:

- uses: astradial/uptime-monitor@v1.41.3-astradial

We also keep the matching patch in src/helpers/workflows.ts so a future rebundle picks it up — but the bundle is the source of truth at runtime.

Tag history (read this before re-touching the fork)¶

Tag	What it shipped	State
`v1.41.1-astradial`	Patched `src/helpers/workflows.ts` only — never rebundled. `dist/index.js` still ran the unpatched code.	Broken — kept crashing on empty listReleases.
`v1.41.2-astradial`	Bumped `action.yml` runtime `node20` → `node24` to clear the deprecation warning. Did not rebundle.	Broken — `node_libcurl.node` mismatch (NODE_MODULE_VERSION 115 vs 137). Reverted via astradial/upptime#7.
`v1.41.3-astradial` (first attempt)	Re-bundled `dist/index.js` from scratch with ncc on macOS + `npm install --ignore-scripts`. ncc embedded a `@@notfound.js` stub for `node_libcurl.node` because the post-install never compiled the binary on macOS.	Broken — `Cannot find module '../lib/binding/node_libcurl.node'`. Reverted same hour.
`v1.41.3-astradial` (current)	Surgical patch. Reset to `v1.41.1` commit, edited the single offending line in the existing `dist/index.js` to add `?.` + `??` + try/catch. No rebundle, no rebuild — original `node_libcurl.node` stays untouched.	Working.

Prevention Rules¶

GitHub Actions runs the bundle, not the source. A src/-only patch is silently a no-op until someone rebuilds. Either patch dist/<entry>.js directly (surgical edit), or do a full rebundle — but if rebundling, you must include the prebuilt native modules (see rule 2).
Don't rebundle this action on macOS without first compiling node-libcurl for Linux. The bundle will only work on the GitHub Actions Linux runner if node_modules/node-libcurl/lib/binding/node_libcurl.node exists at bundle time — otherwise ncc inserts a @@notfound.js stub and the action throws Cannot find module at runtime. npm install --ignore-scripts skips the binary, so a clean rebundle from macOS is broken-by-default. For small fixes, prefer a surgical edit of the existing dist/index.js (we did this for v1.41.3-astradial).
Don't rev action.yml runs.using without recompiling node_libcurl. The .node file in dist/lib/binding/ is ABI-locked to a specific Node major. Bumping node20 → node24 without rebuilding the binary against node24 yields the NODE_MODULE_VERSION error above. The Node 20 deprecation warning is fine to live with until someone gets a node24-compiled prebuilt.
Treat upstream actions with issues-disabled + dormant repos as unmaintained. Fork before adopting for production-impacting work; pin to your fork's tag.

Verification¶

After bumping the pin, manually trigger a run and watch the logs:

gh workflow run "Uptime CI" --repo astradial/upptime
gh run watch --repo astradial/upptime

A green run with the 🔼 Upptime @v1.41.0 banner in the generated workflow header confirms the fallback path is working (since getUptimeMonitorVersion now returns the literal "v1.41.0" when the API call returns empty).

Error 50: Editor "Add Trunk" Form Has No Password Field — Inbound Trunk Save Silently 400s¶

Symptom¶

In editor.astradial.com → org → Trunks → + Add Trunk, you fill in Name, Host, Username, pick Inbound as the trunk type, click Create, and:

The dialog closes
A toast briefly flashes "Failed to create" (easy to miss)
The trunk does NOT appear in the list
Some users report being "logged out" — actually a navigation glitch where the editor lands on a different org's overview page after the failure

Root Cause¶

The editor's Create Trunk dialog (editor/app/dashboard/[orgId]/trunks/page.tsx) was missing the Password input entirely:

The form's React state declared password: "" but no <Input> rendered it.
handleCreate() did not pass password in the API call.

The API requires it — api/src/server.js:1357:

if (trunk_type === 'inbound' && (!username || !password)) {
  return res.status(400).json({ error: 'Username and password are required ...' });
}

So inbound and outbound trunk creation always 400'd. peer2peer trunks (the only existing type at the time, e.g. the Tata trunk per org) didn't need a password, so the bug went unnoticed until VSEVEN HOTELS became the first customer to use an inbound trunk to connect their on-prem UCM6301 to the cloud.

Diagnosis¶

Cloud access log will show repeated POST /api/v1/trunks returning 400:

ssh root@82.180.146.80 'tail -200 /root/.pm2/logs/astrapbx-out.log | \
  grep -E "POST /api/v1/trunks.*400"'

DB will have no new row for the trunk:

SELECT id, name, trunk_type FROM sip_trunks WHERE org_id='<org-id>'
ORDER BY created_at DESC LIMIT 5;

Fix¶

Add a Password input to the Create Trunk dialog with show/hide toggle and a Generate button (32-char hex via WebCrypto). Send password in the API call. After successful create, open a Credentials dialog showing server, port, transport, username, password (masked) with copy buttons. Add a "View credentials" entry in the row dropdown that GETs /trunks/:id and re-opens the same dialog.

Files: editor/lib/pbx/client.ts (add password?: string to PbxTrunk type), editor/app/dashboard/[orgId]/trunks/page.tsx (form + dialog).

Branch: fix/trunks-form-password-field (commit a0be1d0).

Prevention Rule¶

When adding new trunk types or other multi-shape API endpoints, always smoke-test the editor flow against ALL trunk types before shipping, not just the default. The bug existed since the trunks page was first written but was invisible until a customer hit the non-default branch.

For new endpoints whose form-required fields differ by trunk_type (or any other discriminator), prefer: - Conditional rendering based on the discriminator, OR - A single shared form that always sends every field, with the API ignoring irrelevant ones — much harder to silently break.

Error 51: Editor User Routing — `ring_target=phone` / `routing_type=ai_agent` Silently Reverts to Defaults¶

Symptom¶

In the editor's org → Users page, you create a user with Routing → Phone + a phone number, or edit an existing user to switch routing. After Save:

The user appears in the list (or the existing user looks updated for a moment)
After the list refreshes, the routing column shows SIP again
The phone number field appears empty
For the edit flow, a toast briefly shows "Failed to update"

Root Cause¶

Two independent bugs that together produced one symptom. The CREATE flow had one bug, the EDIT flow had a different one.

Bug A — POST /api/v1/users dropped routing fields. The handler destructured only:

const { extension, username, password, full_name, email, role = 'agent' } = req.body;

phone_number, ring_target, routing_type, routing_destination were never read from the request body, never passed to User.create(), so the model defaults applied (ring_target='ext', routing_type='sip').

Bug B — PUT /api/v1/users/:id/routing did not exist on the API. The editor's handleEdit calls two endpoints in sequence:

await users.update(editUser.id, { full_name, email, extension, role, outbound_did, password });
await users.updateRouting(editUser.id, { routing_type, routing_destination, ring_target, phone_number });

The first PUT was fine (didn't touch routing fields). The second hit PUT /users/:id/routing — an endpoint the API never had — so it returned 404. The 404 caused handleEdit to throw and loadUsers() to never run, leaving the UI showing the pre-edit state.

Why this lay dormant for so long¶

The only customer using ring_target=phone before VSEVEN HOTELS was GrandEstancia. Per docs/guides/grandestancia-setup.md, all GrandEstancia routing changes were made via direct curl to PUT /api/v1/users/{userId} (the main update endpoint, which was correctly mapping ring_target and phone_number from allowedFields), not via the editor UI. So the editor's edit-routing path had never been exercised by a real customer until VSEVEN.

Diagnosis¶

Access log shows the dual-call pattern with the second 404'ing:

PUT /api/v1/users/<uuid>           → 200
PUT /api/v1/users/<uuid>/routing   → 404

DB row stays unchanged:

SELECT id, extension, routing_type, ring_target, phone_number, updated_at
FROM users WHERE id='<uuid>';

For Bug A specifically: a user just created appears in the response with routing_type='sip' and ring_target='ext' regardless of what the client sent.

Fix¶

Bug A — POST /api/v1/users: destructure the routing fields from the body and pass them to User.create() only when the client explicitly sent them (so omission preserves model defaults rather than overwriting with undefined):

const {
  extension, username, password, full_name, email, role = 'agent',
  phone_number, ring_target, routing_type, routing_destination,
} = req.body;
// ...
const user = await User.create({
  /* ...existing fields... */
  ...(phone_number !== undefined && { phone_number }),
  ...(ring_target !== undefined && { ring_target }),
  ...(routing_type !== undefined && { routing_type }),
  ...(routing_destination !== undefined && { routing_destination }),
});

Also extend PUT /api/v1/users/:id allowedFields with routing_type, routing_destination (it already had ring_target, phone_number).

Bug B — Add the missing PUT /api/v1/users/:id/routing endpoint, mirroring the existing PUT /api/v1/dids/:id/routing pattern:

app.put('/api/v1/users/:id/routing', authenticateOrg, async (req, res) => {
  const user = await User.findOne({ where: { id: req.params.id, org_id: req.orgId } });
  if (!user) return res.status(404).json({ error: 'User not found' });

  const { routing_type, routing_destination, ring_target, phone_number } = req.body;
  const updateData = {};
  if (routing_type !== undefined) updateData.routing_type = routing_type;
  if (routing_destination !== undefined) updateData.routing_destination = routing_destination || null;
  if (ring_target !== undefined) updateData.ring_target = ring_target;
  if (phone_number !== undefined) updateData.phone_number = phone_number || null;

  // ring_target='phone' without phone_number is unreachable — fail loudly.
  const finalRingTarget = updateData.ring_target ?? user.ring_target;
  const finalPhoneNumber = updateData.phone_number ?? user.phone_number;
  if (finalRingTarget === 'phone' && !finalPhoneNumber) {
    return res.status(400).json({ error: 'phone_number is required when ring_target is "phone"' });
  }

  await user.update(updateData);
  const { password_hash, sip_password, ...userData } = user.toJSON();
  res.json(userData);
});

Branch: fix/users-routing-fields-not-saved (commits 8794837 for Bug A, 802f233 for Bug B).

Prevention Rules¶

When the editor calls a new API endpoint, grep the API to confirm it actually exists. A 404 in the access log is the only evidence this kind of bug ever happened — easy to miss if you're not looking.
req.body destructuring with hardcoded fields silently drops everything else. Whenever a model gains optional fields, every endpoint that creates/updates that model needs to be audited for whether it forwards those fields. Better still: use a single allowlist constant and have both POST and PUT iterate it.
Smoke-test create AND edit flows before declaring a feature done. Bug A would have been caught by trying to create a phone-routed user, Bug B by trying to edit one to phone-routed. We had docs that documented this feature working (GrandEstancia) but only via curl to the main PUT — so the docs themselves masked the editor bug.
For paired endpoints like PUT /resource/:id and PUT /resource/:id/routing: keep them next to each other in server.js and document that they're paired so the next person who removes one notices the other.

Error 52: SIP REGISTERs silently dropped after PJSIP reload — fail2ban storm masks it¶

Symptom¶

After a PROD config push, multiple customer softphones across multiple orgs report SIP 408 Request Timeout. tcpdump on the PROD VPS confirms the REGISTERs arrive at eth0, but Asterisk sends no response and /var/log/asterisk/full.log has no entry for the attempts. The PROD VPS appears completely unresponsive to new registrations while existing long-lived ones (e.g., a customer PBX over WireGuard) keep working.

Root Cause¶

Two stacked failures producing one symptom:

1. PJSIP module left in a stuck state by back-to-back reloads. The dialplan generator triggered module reload res_pjsip.so twice within ~1 second (19:14:39 then 19:14:40). PJSIP transports are not fully reloadable — Asterisk logs:

NOTICE: Transport 'transport-udp' is not fully reloadable, not reloading:
  protocol, bind, TLS, TCP, ToS, or CoS options.

The second reload arrived while the first was still processing. The end state was a sorcery cache where new endpoints/auths weren't fully wired into the distributor, so PJSIP silently failed to dispatch incoming REGISTER messages to any matched endpoint. Existing in-memory contacts continued working; new registrations went into a black hole.

2. fail2ban then banned the customers retrying. With registrations failing, every softphone on the customer side started retrying every 4 seconds. Each first-REGISTER (sent without auth, expected to receive a 401 challenge) is logged by Asterisk as:

NOTICE: Request 'REGISTER' from '<sip:...>' failed for '<HOST>' (callid: ...)
   - No matching endpoint found

This is the regular handshake log line — fail2ban matches it against the asterisk filter's failregex and counts it as a "failed registration". With maxretry=3 / findtime=600 / bantime=86400, three retries inside 10 minutes triggered a 24-hour ban. So even after the PJSIP state was unstuck, banned IPs kept getting silently dropped at the iptables layer with no further log entry — making it look like the underlying bug was still present.

Diagnosis¶

# 1) Confirm packets reach eth0 (rules out network / GeoIP firewall)
tcpdump -ni eth0 'host <CUSTOMER_IP> and udp port 5080'

# 2) Confirm Asterisk doesn't see them (no log entries)
grep '<CUSTOMER_IP>' /var/log/asterisk/full.log | tail -10

# 3) Check fail2ban bans
fail2ban-client status asterisk
# Banned IP list: <CUSTOMER_IP> ...

# 4) Check pjsip state freshness — uptime vs last reload vs disk mtime
asterisk -rx 'core show uptime'        # 'Last reload: ...'
ls -la /etc/asterisk/pjsip*.conf       # mtime of disk configs
# If disk mtime > last reload time → reload didn't happen
# If disk mtime ~= last reload time → reload happened, may have been double-fire

Fix¶

# 1) Clear PJSIP stuck state
ssh root@82.180.146.80
asterisk -rx 'module reload res_pjsip.so'
# Verify: tail /var/log/asterisk/full.log for any ERROR/WARNING

# 2) Unban affected customer IPs (one at a time per the chosen scope)
fail2ban-client set asterisk unbanip <CUSTOMER_IP>
# Or for the whole jail (use only if you intend to unban everyone):
fail2ban-client unban --all asterisk

# 3) Tune fail2ban so legitimate retry-flood doesn't keep re-banning
sudo cp /etc/fail2ban/jail.local /etc/fail2ban/jail.local.bak-$(date +%F)
sudo sed -i '/^\[asterisk\]/,/^\[/{ s/^maxretry = 3$/maxretry = 10/ }' /etc/fail2ban/jail.local
fail2ban-client reload asterisk
fail2ban-client get asterisk maxretry   # → 10

Prevention Rules¶

Avoid back-to-back PJSIP reloads. If the API/generator triggers a reload, debounce so two writes within N seconds collapse to one reload. Specifically: when multiple dialplan files are written in sequence (e.g., during a multi-org regeneration), do all writes first, then reload once.
Prefer asterisk -rx 'pjsip reload' over module reload res_pjsip.so. The former reloads only PJSIP config; the latter restarts the whole module which can race with active requests.
fail2ban filter regex matches No matching endpoint found — which fires on every legitimate first-REGISTER. With default maxretry=3, normal users on flaky networks trip the ban during a retry storm. PROD ran maxretry=10 as an interim measure from 2026-05-06. As of 2026-05-11 the long-term fix is in place: split into asterisk-auth (strict, maxretry=3) and asterisk-scan (lenient, maxretry=50/hour) + customer-PBX ignoreip whitelist. See Fail2Ban Runbook and Error 55.
Diagnose silent SIP drops with tcpdump on eth0 first. If packets arrive but Asterisk has no log entry, check fail2ban before chasing PJSIP — iptables drops happen before PJSIP sees them.

Error 53: Auto-ticket classifier tags human calls as `bot_dropped` for any org reusing extensions 1003/1012/1013¶

Symptom¶

A customer's Tickets page shows bot_dropped urgent-priority tickets ("Call ended without ticket (40s) from ") for ANSWERED calls, even though the customer has no bot configured. The same ext numbers (1003, 1012, 1013) are used across many orgs for plain SIP human extensions, and all of them get tagged.

Root Cause¶

The auto-ticket classifier in the LogsUpdate Cloud Run service had this constant:

BOT_EXTENSIONS = {"1003", "1012", "1013"}

Any ANSWERED inbound where the destination channel's extension matched this set was treated as a bot call → 8-second wait for the bot to file its own ticket → if no ticket appeared, classifier created a bot_dropped ticket.

This was a global set with no org awareness. At inception only one org used these numbers for bots. Later orgs onboarded humans on the same numbers (e.g. Thangam Communication's 1003 = GokulRaj, 1004 = Raju) and all got false-positive bot_dropped tickets every time the human extension's mobile-forward Dial timed out.

Diagnosis¶

# 1) Confirm the answered ext is human, not bot
ssh root@82.180.146.80
grep -A20 '^exten => 1003,1' /etc/asterisk/ext_<org>.conf
# If you see Dial(PJSIP/<mobile>@<trunk>,...) — it's a human/mobile-forward, NOT a bot

# 2) In the editor's Users page, the target user's routing_type
#    should be 'ai_agent' for true bots; 'sip' means human/mobile.

# 3) Cloud Run logs show the misclassification path
gcloud run services logs read logsupdate --region us-central1 --limit 50 \
  | grep -E 'AUTO-TICKET|Bot call detected'

Fix¶

Make bot-detection per-org by querying User.routing_type='ai_agent' instead of a hardcoded list.

API side (api/src/server.js) — new internal endpoint:

// POST /api/v1/users/internal/bot-extensions
// Returns the org's extensions where routing_type='ai_agent'.
app.post('/api/v1/users/internal/bot-extensions', async (req, res) => {
  if (req.headers['x-internal-key'] !== process.env.INTERNAL_API_KEY) {
    return res.status(401).json({ error: 'Unauthorized' });
  }
  const users = await User.findAll({
    where: { org_id: req.body.org_id, routing_type: 'ai_agent', status: 'active' },
    attributes: ['extension'],
  });
  res.json({ extensions: users.map(u => u.extension).filter(Boolean) });
});

Note: the User model's foreign-key column is org_id, not organization_id — easy to miss; staging surfaces it as Unknown column 'User.organization_id'.

LogsUpdate side (server.py) — replace the constant with a 5-minute-cached lookup:

_bot_ext_cache: dict[str, tuple[float, set[str]]] = {}
_BOT_EXT_TTL_SECS = 300

async def get_bot_extensions(org_id: str) -> set[str]:
    cached = _bot_ext_cache.get(org_id)
    if cached and (time.time() - cached[0]) < _BOT_EXT_TTL_SECS:
        return cached[1]
    try:
        async with httpx.AsyncClient(timeout=5) as client:
            resp = await client.post(
                f"{PBX_BASE_URL}/api/v1/users/internal/bot-extensions",
                headers={"X-Internal-Key": PBX_INTERNAL_KEY,
                         "Content-Type": "application/json"},
                json={"org_id": org_id},
            )
        exts = ({str(e) for e in resp.json().get("extensions", []) if e}
                if resp.status_code == 200 else set())
    except Exception as e:
        # Fail closed — empty set means we don't tag anything as bot_dropped
        # for this org during the failure window. Better than tagging humans.
        logger.warning(f"[AUTO-TICKET] bot-extensions lookup failed for org {org_id}: {e}")
        exts = set()
    _bot_ext_cache[org_id] = (time.time(), exts)
    return exts

# At the call site:
bot_extensions = await get_bot_extensions(org_id)
if answered_ext in bot_extensions:
    ...

Cleanup of existing false-positive tickets — LogsUpdate/scripts/cleanup_bot_dropped_tickets.py lists then optionally deletes bot_dropped tickets for a given org. Dry-run by default; requires re-typing the org UUID at the prompt to actually execute.

Prevention Rules¶

Classifier rules that branch on extension number must be org-scoped. Extension 1003 means different things across orgs.
Hardcoded sets that depend on customer behaviour are short-lived. When a feature has 1 customer the assumption "they're all the same" silently breaks at customer 2. Introduce the per-org config at the same time as the feature, not later.
Fail closed on classifier-side lookup errors. If the bot-detection API call fails, default to empty set (no calls are bots) rather than empty set fallback to old global list — false-positive tickets create user trust issues; missing a real bot_dropped ticket for a few minutes during an outage is acceptable.
For Cloud Run + AstraPBX dual-deploy fixes: deploy the API endpoint first (LogsUpdate's old code still works), then deploy LogsUpdate. If LogsUpdate ships first it'd 404 on every call.

Error 54: Outbound dial fails for 12-digit Indian numbers (`919XXXXXXXXX`) but works for 10-digit¶

Symptom¶

Agent dials 919944421125 (12-digit, with India country code) from a softphone — call doesn't connect or fails fast. The same destination dialled as 9944421125 (10-digit) works fine.

This is a real production issue because Zoiper / most softphones render incoming CallerID with the 91 prefix (E.164-style), so when the agent uses tap-to-call from a missed-call notification or call history, the number includes the 91 and the call fails.

Root Cause¶

Tata's outbound termination expects the bare 10-digit subscriber number, not the 12-digit form with country code. Both formats reach Asterisk and match the catch-all _X. outbound pattern, but only the 10-digit one connects through Tata's network. There was no normalization step in the dialplan to strip the 91 before dialling the trunk.

Diagnosis¶

ssh root@82.180.146.80

# Look up which dialplan match Asterisk picked for the dialed number
asterisk -rx "dialplan show 919944421125@org_<prefix>__outbound"
# Before the fix: only `_X.` matched, EXTEN passed unchanged to Dial(...)
# After the fix:  `_91XXXXXXXXXX` is most-specific; Goto strips 91, re-enters
#                 as 9944421125, then `_X.` Dials the trunk

# Confirm the catch-all is dialing the raw number
grep "Dial(PJSIP" /etc/asterisk/ext_<org>.conf | head -3

Fix¶

Dialplan generator (api/src/services/asterisk/dialplanGenerator.js) emits a strip rule in every outbound context, mirroring the existing E.164 leading-+ strip:

; Normalize Indian country code '91' (Tata trunk expects 10-digit)
exten => _91XXXXXXXXXX,1,NoOp(Stripping 91 country code from ${EXTEN})
exten => _91XXXXXXXXXX,n,Goto(${EXTEN:2},1)

Pattern is intentionally narrow — _91XXXXXXXXXX matches exactly 12 digits starting with 91, so it cannot affect:

10-digit dials (still match _X. directly)
Internal 4-digit extension dials
Conference / queue / call-forward / speed-dial patterns
Inbound calls (different context)
Other trunks for the same org (route ID matching, not EXTEN)

After deploying the API change, regenerate all org dialplans + reload Asterisk:

ssh root@82.180.146.80
cd /opt/astrapbx
node -e '
(async () => {
  const ConfigDeploymentService = require("./src/services/asterisk/configDeploymentService");
  const inst = new ConfigDeploymentService();
  const { Organization } = require("./src/models");
  const orgs = await Organization.findAll({ where: { status: "active" } });
  for (const o of orgs) {
    process.stdout.write("  - " + o.name + " ... ");
    try { await inst.deployOrganizationConfiguration(o.id, o.name); console.log("ok"); }
    catch (e) { console.log("FAIL: " + e.message); }
  }
  await inst.reloadAsteriskConfiguration();
  process.exit(0);
})();
'

See Outbound Dialplan Normalization for the full reference of all normalization rules.

Prevention Rules¶

Trunk-format assumptions belong in the dialplan, not in user habit. Don't tell agents "always dial 10-digit" — bake it into the generator so any input format reaches the trunk in the format the trunk expects.
When introducing a new trunk that needs a different format, gate the strip rules per-trunk. Today all customer trunks share Tata via the NUC tunnel (same upstream); a future trunk that requires 91 preserved would need a per-trunk flag on OutboundRoute that the generator reads.
Use narrow pattern matching, not wildcards, for normalization rules. _91XXXXXXXXXX (exactly 12 digits) is safe; _91X. (any number starting with 91) would swallow extensions starting with 91 if they ever existed.
After a generator change, regenerate ALL org configs and reload Asterisk. A change in the generator only affects future runs; existing files on disk keep the old behaviour until re-deployed. Run the loop above.

Error 55: Customer PBX IP change → fail2ban storm → all of that customer's endpoints go Unavail¶

Symptom¶

A customer reports "no SIP user is registering for the last N minutes" — was working until recently. pjsip show contacts for the affected org shows the customer's extensions as Unavail RTT=-nan against a specific public IP. Other orgs are fine.

Root Cause¶

The customer's ISP-assigned public IP changed (common for SMB-grade internet in India). When the change happens, all of the customer's PBX-connected phones / their on-prem UCM re-register from the new IP simultaneously — many REGISTER attempts in a few seconds.

Each first REGISTER (before the 401 challenge) emits the log line:

NOTICE: Request 'REGISTER' from '<sip:...>' failed for '<NEW_IP>:port' - No matching endpoint found

This line is also matched by the legacy fail2ban asterisk filter as a "failed registration". With the old monolithic jail's maxretry=10, a customer with 7+ phones exceeded the threshold in seconds and got banned for 24 h. Asterisk's OPTIONS qualify probes to the customer's old NAT'd ports then arrived at dead pinholes, marking every endpoint Unavail.

Diagnosis¶

ssh root@82.180.146.80

# 1) Is the customer's IP banned?
fail2ban-client status asterisk-auth   # current jails (post-2026-05-11)
fail2ban-client status asterisk-scan
# Look for the customer's IP in "Banned IP list"

# 2) Find the IP — current contacts for the org
asterisk -rx "pjsip show contacts" | grep <org_prefix>
# Unavail rows show the IP in the contact URI

# 3) Confirm by tcpdump — packets arrive but Asterisk doesn't reply
tcpdump -ni eth0 'host <CUSTOMER_IP> and udp port 5080'

Fix¶

Immediate: unban the IP.

fail2ban-client set asterisk-auth unbanip <CUSTOMER_IP>
fail2ban-client set asterisk-scan unbanip <CUSTOMER_IP>
# Existing stale contacts auto-clear within ~5 min as the customer's phones
# send fresh REGISTERs and create new NAT bindings.

Recurrence prevention: add the customer's PBX subnet to fail2ban's ignoreip whitelist + ensure the two-jail split (auth vs scan) is in place. See Fail2Ban Runbook — full operational reference.

Prevention Rules¶

Whitelist known customer PBX subnets. Use /24 (256 IPs) — narrow enough to avoid hiding real attackers, broad enough to absorb ISP renumbering within a subnet. Add via [DEFAULT] ignoreip in /etc/fail2ban/jail.local. The runbook documents the format.
Keep the asterisk filter split (asterisk-auth strict, asterisk-scan lenient). The single-jail design pre-2026-05-11 conflated handshake-noise with attack-noise; never go back without a replacement.
/24, not /16. A /16 whitelists 65k IPs across the same ISP block — too broad. /24 keeps the surface narrow while covering ISP renumbering within a sub-block.
When a customer's IP changes legitimately and falls outside their whitelisted /24, update the whitelist promptly. Don't expand to /16 as a shortcut — keep the discipline.

Error 56: SIP phone shows "unregistered" with zero packets reaching cloud — transport TCP/UDP mismatch¶

Symptom¶

A new IP phone is configured with the correct SIP server (10.20.0.1:5080), correct username + auth ID + password, account is marked Active, and the phone is on a network whose WireGuard tunnel to cloud is healthy — but the phone reports "Not registered" / "Failed" in its account status.

On the cloud side:

pjsip show endpoint <endpoint_name> returns the endpoint with state Unavailable and no Contact row.
pjsip show contacts | grep <org_prefix> shows zero contacts for that extension.
wg show wg1 shows the customer's peer connected with a recent handshake and bidirectional byte counters — tunnel is fine.
Another extension on the same customer LAN is registering normally (e.g. ext 09 from 192.168.0.76 Avail, ext 03 from 192.168.0.62 Unavailable).
tcpdump -i wg1 -n 'udp port 5080' captures zero packets in a 5–10 second window — REGISTER attempts from the broken phone never reach the cloud at all.

Root Cause¶

The phone's account is configured with TCP transport while the Astradial cloud Asterisk PJSIP endpoint is configured with transport-udp (UDP only, bound on 0.0.0.0:5080). The phone happily opens a TCP socket toward 10.20.0.1:5080; Asterisk doesn't listen for TCP on that port, so the SYN is either silently dropped at the cloud (no listener) or the TCP handshake completes against nothing and the REGISTER never gets parsed.

UDP REGISTERs would arrive and be visible in tcpdump and Asterisk logs. TCP REGISTERs to a UDP-only listener leave no trace on either side — which is why this diagnosis is initially confusing.

This is a v7-style edge case but applies to any customer onboarding a new phone after the initial install. The first phone gets configured correctly (often by Astradial directly), then the hotel's IT person duplicates the config for a second account and the transport setting gets toggled (often by accident in vendor web UIs where the dropdown defaults to TCP/TLS for "security").

Diagnosis¶

ssh root@82.180.146.80

# 1) Endpoint exists + auth password matches DB?
asterisk -rx "pjsip show endpoint <endpoint_name>"
# Look for State + Aor + Contact rows. Unavailable + zero contacts = phone never registered.

asterisk -rx "pjsip show auth <endpoint_name>_auth"
# Compare the `password` field with the users.sip_password column in the DB.

# 2) Tunnel health
wg show wg1 | grep -A2 "<customer's wg public key>"
# 'latest handshake' should be < 3 min ago.

# 3) Is the phone even sending packets to us?
timeout 10 tcpdump -i wg1 -n 'udp port 5080 and host <customer-LAN-CIDR>'
# Zero packets in 10s = phone isn't sending UDP, OR isn't sending at all.

# 4) Confirm Asterisk's transport binding (sanity check)
asterisk -rx "pjsip show transports"
# Expect transport-udp on 0.0.0.0:5080. We do NOT run a TCP listener on 5080.

If steps 1–3 all check out (endpoint right, auth right, tunnel up, zero packets), the phone's transport is almost certainly TCP/TLS pointing at a UDP-only port.

Fix¶

On the phone's web UI, under the account's SIP Settings → Basic / Network, change the SIP Transport from TCP (or TLS) to UDP. Click Save and Apply (not just Save — Save alone persists config but doesn't restart the SIP stack).

For Grandstream phones (GHP6xx, GRP26xx) specifically: the Transport setting lives under Account N → SIP Settings → Basic Settings → SIP Transport in some firmware versions, and under Account N → Network Settings in others. Search the page for "Transport".

The registration attempt should land within seconds — confirm by re-running step 3 (tcpdump) and asterisk -rx "pjsip show endpoint <endpoint>".

Prevention Rules¶

When pre-provisioning a new IP phone for an Astradial customer, set SIP Transport = UDP in the master template before handing the device over. Don't trust the vendor's default.
For Grandstream-family phones, the recommended firmware default is UDP — confirm in the auto-provisioning template if one is used.
The cloud Asterisk PBX is UDP-only on the SIP port (0.0.0.0:5080). TLS+TCP support is on the roadmap but not currently enabled — don't tell customers "use TLS for security" until the listener actually exists.
The "zero packets in tcpdump" diagnostic is the cheapest single test for this class of failure. If a phone says unregistered and tcpdump -i wg1 udp port 5080 is silent, suspect transport mismatch first.

Error 57: Greeting/IVR audio silent after TTS upgrade — Asterisk `format_wav.so` rate mismatch¶

Symptom¶

A queue greeting or IVR prompt was working before; an operator regenerates it (or creates a new one); now callers hear dead silence where the greeting should play. The queue's MOH starts immediately as if no greeting was configured. Reception from editor.astradial.com → Departments → greeting → Preview plays fine in the browser, which makes the diagnosis confusing.

On the cloud side:

The greeting row exists, has a non-NULL audio_file (e.g. greeting_<uuid>.wav).
The WAV exists on disk under /var/lib/asterisk/sounds/greetings/.
asterisk -rx "dialplan show <queue-num>@<context>_queue" shows the expected Playback(/var/lib/asterisk/sounds/greetings/greeting_<uuid>) priority.
No error in /var/log/asterisk/full mentioning the greeting file.
The .wav was generated AFTER the most recent TTS-service deploy (e.g. PR #156 / PR #157).

Root Cause¶

Asterisk's .wav format module (format_wav.so) only handles 8 kHz and 16 kHz mono LINEAR16 PCM. From asterisk -rx "module show like wav":

format_wav.so  Microsoft WAV/WAV16 format (8kHz/16kHz S16LE)

Higher rates (24 / 32 / 44.1 / 48 kHz) are supported only via the format_slin* family with the matching .sln24 / .sln32 / .sln44 / .sln48 extensions — NOT .wav. Putting a 24 kHz file in .wav loads the format module, the module fails to decode, and Playback() silently no-ops. There's no error in the Asterisk log because the file load itself succeeds; the failure is inside the codec's frame loop after the RIFF header parse.

This regression shipped in the Chirp 3 HD TTS upgrade (PR #156 / promoted in #157) where the synthesis rate was changed from 8 kHz → 24 kHz under the (incorrect) belief that Asterisk's WAV reader handles arbitrary rates via the RIFF header. That's true for format_slin* (.sln*); it's false for format_wav (.wav). The cloud's Google TTS Speech API does accept and honor sampleRateHertz: 16000, downsampling its 24 kHz native output server-side — so 16 kHz .wav gives us wideband quality AND Asterisk compatibility.

V7's queue 5001 was the first prod casualty (2026-05-13). Other orgs were spared only because they hadn't yet regenerated any greeting since the upgrade.

Diagnosis¶

ssh root@82.180.146.80

# 1) What sample rate does Asterisk's WAV module actually support?
asterisk -rx "module show like wav"
# Look for "8kHz/16kHz" in the description. If the description doesn't list
# your file's rate, that's the bug.

# 2) Confirm the actual rate of the file on disk
file /var/lib/asterisk/sounds/greetings/greeting_<uuid>.wav
# Expect output like ... mono <RATE> Hz. If RATE is 24000+ AND module supports
# only 8k/16k, you're hit.

# 3) Cross-check the TTS service synthesizes at a supported rate
grep -n SAMPLE_RATE_HZ /opt/astrapbx/src/services/ttsService.js
# Should be 8000 or 16000. If 24000+, this is the bug source.

Fix¶

Two parts: the broken file on disk and the source code that creates new broken files.

a) Restore audio for an already-affected greeting (one file):

ssh root@82.180.146.80
cd /var/lib/asterisk/sounds/greetings/
F=greeting_<uuid>.wav
cp "$F" "$F.bak-24khz"
ffmpeg -y -loglevel error -i "$F.bak-24khz" -ar 16000 -ac 1 -c:a pcm_s16le "$F"
file "$F"   # confirm "16000 Hz"
# No Asterisk reload needed — Playback re-opens the file on each call.

Greeting should play correctly on the next call. The backup stays on disk so you can roll back if needed.

b) Fix the source so new greetings come out right:

In /opt/astrapbx/src/services/ttsService.js, the SAMPLE_RATE_HZ constant. Change 24000 to 16000, save, restart astrapbx:

ssh root@82.180.146.80
cd /opt/astrapbx
cp src/services/ttsService.js src/services/ttsService.js.bak-24khz
sed -i 's/^const SAMPLE_RATE_HZ = 24000;$/const SAMPLE_RATE_HZ = 16000;/' src/services/ttsService.js
pm2 restart astrapbx

Then sync the same change back to git on a hotfix branch (otherwise the next CI deploy rsyncs the broken constant back over your hotfix — see the Prod Direct-Edit runbook).

c) (Optional) Re-render every other greeting on prod that was generated at the bad rate:

# Identify any 24kHz files still on disk
for f in /var/lib/asterisk/sounds/greetings/*.wav; do
  rate=$(file "$f" | grep -oE '[0-9]+ Hz' | head -1)
  [ "$rate" = "24000 Hz" ] && echo "BROKEN: $f"
done

For each, either re-run ffmpeg ... -ar 16000 like (a), or have the operator click "Generate greeting" in the editor (it'll re-synthesize against the patched ttsService and produce a correct file).

Update (post-mortem) — the actual final fix is `.ulaw`-only, not `.wav` at 16 kHz¶

The 16 kHz .wav fix above looked correct on paper but failed in practice on the V7 incident the same day because of a second bug under it: this Asterisk build's Playback() doesn't fall back to .wav when a .ulaw sibling is missing on a G.711 mu-law channel. Log signature:

WARNING file.c: Unable to open .../greeting_<id> (format (ulaw)): No such file or directory
WARNING app_playback.c: Playback failed on PJSIP/... for .../greeting_<id>

Asterisk tried .ulaw first (cheapest format for a mu-law channel), failed, and gave up instead of trying the .wav we'd just produced. The PSTN-bound greeting was silent again.

The final answer: synthesize the greeting as MULAW 8 kHz via Google's audioEncoding=MULAW and save as .ulaw (one file, no .wav sibling). PSTN/SIP softphone channels are virtually all G.711 mu-law, so Asterisk reads our bytes and writes them directly with zero transcoding. Wideband softphones (Opus/AMR-WB) get a clean mu-law → slin8 → opus transcode with imperceptible degradation on spoken audio.

Quirk handled by the code: Google's audioEncoding=MULAW response is wrapped in a RIFF/WAVE container (verified empirically — file reports WAVE audio, ITU G.711 mu-law). Asterisk's format_g711.c reader needs RAW bytes. The TTS service strips the WAVE header by locating the data chunk marker and slicing past its 8-byte ID + size prefix before writing.

Final ttsService.js shape:

const AUDIO_ENCODING = 'MULAW';
const SAMPLE_RATE_HZ = 8000;
// ...
async saveGreetingAudio(greetingId, text, language, voice, opts = {}) {
  const wrapped = await this.generateAudio(text, language, voice, opts);
  const raw = TTSService._stripWavHeader(wrapped);
  await fs.writeFile(`greeting_${greetingId}.ulaw`, raw);
}

The dialplan generator continues to emit Playback(/var/lib/asterisk/sounds/greetings/greeting_<id>) with no extension — Asterisk's filename-extension lookup finds the .ulaw automatically.

Prevention Rules¶

TTS audio for Asterisk playback should ship as .ulaw (G.711 mu-law) by default. PSTN itself is mu-law; saving as mu-law guarantees bit-perfect preservation of the synthesis output through the dominant call path. Wideband (.wav 16 kHz, .sln16, etc.) is only worth the complexity if you have a documented use case where the call leg is HD-codec end-to-end.
Don't trust Asterisk's file-format auto-fallback to do what you'd want. On a mu-law channel, Playback("greeting_x") tries .ulaw first; if missing, fall-back behavior is build-dependent and on our deployed Asterisk it just fails. Always provide a file in the format the channel will actually use.
Google's audioEncoding=MULAW returns a WAV-wrapped payload, NOT raw mu-law. Always strip the RIFF header before writing as .ulaw (or save as .wav if you don't care about Asterisk's no-transcode path — but see rule 2).
The Asterisk module's description string IS the contract. When module show like <name> says "8kHz/16kHz" in the description, that's the literal range. Don't infer support from related modules.
Verify the file format AFTER the first TTS synth on prod every time the TTS service changes. file /var/lib/asterisk/sounds/greetings/greeting_* should report either raw data (good — raw mu-law) OR WAVE audio, Microsoft PCM, 16 bit, mono 8000/16000 Hz (also good — legacy 8k/16k .wav). Anything else (24 kHz, mu-law inside WAV, etc.) is broken.

Error 58: PSTN inbound silent on Indian Tata trunk — Asterisk needs `.alaw` sibling for G.711 a-law channels¶

Symptom¶

After fixing Error 57 (switching greetings to .ulaw-only), inbound calls from a softphone (e.g. extension 1009 dialing 5001) work fine and play the new Chirp 3 HD voice. But inbound calls from the PSTN via the Tata trunk (e.g. v7's external DID dialed from a personal mobile) still play either silence or the OLD Allison Smith voice for system prompts ("This call may be recorded for quality and training") even though we regenerated all 44 system prompts in Chirp 3 HD.

Symptoms specifically:

Softphone-originated calls → new voice. ✅
PSTN inbound via Tata trunk → old voice OR silence. ❌
/var/lib/asterisk/sounds/en/<prompt>.ulaw exists with the new Chirp 3 HD audio.
/var/lib/asterisk/sounds/en/<prompt>.gsm (or .wav) still exists with the OLD voice — leftover from the legacy stock Asterisk install.

Asterisk full.log shows lines like:

format_wav.c: Not a supported wav file format (7). Only PCM encoded, 16 bit, mono 8kHz/16kHz files are supported

AND

Unable to open .../greeting_<uuid> (format (alaw)): No such file or directory

Root Cause¶

Indian PSTN signals G.711 a-law, not mu-law. Tata's NNI trunk negotiates PCMA (a-law) in SDP by convention — this is the European/Indian convention vs. the North American mu-law (PCMU). On a Tata-bound channel, Asterisk's Playback() lookup order is:

<file>.alaw (channel-native — zero transcoding)
<file>.sln or <file>.sln* (linear, will transcode)
<file>.gsm (legacy — will transcode)
<file>.wav (only if 8 kHz/16 kHz LINEAR16 — see Error 57)
<file>.ulaw (transcodes to a-law)

Because we only wrote .ulaw files after the TTS migration, Asterisk on an a-law channel tried .alaw (missing) → .sln/.gsm → found the OLD stock-Allison-Smith .gsm in /var/lib/asterisk/sounds/en/ → played the old voice. For operator-created greetings that don't have an old .gsm sibling, it fell through to the broken .wav (Error 57 leftovers) or to nothing at all → silence.

This is invisible during softphone testing because softphones negotiate mu-law and the .ulaw lookup succeeds on path 1 for them — .alaw is never tried.

Diagnosis¶

Confirm Tata trunk is negotiating a-law. During an active inbound call:

asterisk -rx "pjsip show channels"     # find the inbound channel name
asterisk -rx "core show channel <name>" | grep -i codec
# → "ReadFormat: alaw" / "WriteFormat: alaw" confirms it

Compare file presence by codec extension:

ls /var/lib/asterisk/sounds/en/queue-thankyou.{ulaw,alaw,gsm,sln} 2>/dev/null
ls /var/lib/asterisk/sounds/greetings/greeting_<uuid>.{ulaw,alaw} 2>/dev/null

If .ulaw exists but .alaw is missing, this error applies.

Tail /var/log/asterisk/full during a test call and watch for Unable to open ... (format (alaw)): No such file or directory. That's the smoking gun.

Fix¶

Two scopes, two fixes:

Scope 1 — system prompts (/var/lib/asterisk/sounds/en/): one-shot regen script writes both .ulaw and .alaw from the same Google TTS call. The script (api/scripts/regen-system-prompts.js) uses Google's audioEncoding=MULAW, strips the RIFF header, and converts mu-law → a-law via ffmpeg locally.

Scope 2 — operator-created greetings (/var/lib/asterisk/sounds/greetings/): ttsService.saveGreetingAudio() writes both .ulaw and .alaw from a single Google TTS call, using a pure-JS ITU-T G.711 mu-law → a-law byte-table converter (no ffmpeg subprocess; ~1 ms overhead). See PR #163.

The byte-table converter was verified byte-for-byte against ffmpeg -c:a pcm_alaw: all 256 mu-law input values produce identical a-law output, and a real 29.5 KB greeting file matches the ffmpeg-generated equivalent exactly.

Backfill for existing operator greetings:

cd /var/lib/asterisk/sounds/greetings/
for f in *.ulaw; do
  base="${f%.ulaw}"
  if [ ! -f "${base}.alaw" ]; then
    ffmpeg -y -loglevel error -f mulaw -ar 8000 -ac 1 -i "$f" \
      -ar 8000 -ac 1 -c:a pcm_alaw -f alaw "${base}.alaw"
  fi
done
chown asterisk:asterisk *.alaw

Prevention Rules¶

For any new TTS or sound-file work on this stack, generate BOTH .ulaw AND .alaw. Don't assume mu-law-only is enough — every Indian PSTN customer routes through an a-law trunk. The cost is a 1 ms in-process byte conversion; the benefit is correctness on every codec path.
Channel codec governs file-lookup order, not the dialplan. Playback(/path/to/greeting_x) with no extension is the only correct form — Asterisk picks the codec that matches the channel. Hard-coding .wav or .ulaw in the dialplan breaks the lookup chain.
Test the PSTN path explicitly after any TTS change. Calling from a softphone is not sufficient — softphone codec ≠ PSTN codec. Place a real call from a mobile through Tata to the affected DID before declaring done.
When you see Unable to open ... (format (alaw)): No such file or directory in full.log, the next prompt Asterisk WILL play is whatever lower-priority file it finds. This is the source of "wrong voice played" reports — old .gsm files from the legacy Asterisk install act as silent overrides for missing .alaw siblings.
The ITU-T G.711 mu-law → a-law byte translation is well-defined and lossless within G.711 quantization. A 256-byte lookup table generated once at startup is enough. Don't shell out to ffmpeg in the hot path.

Error 59: Queue "Timeout Destination" silently ignored on save¶

Date: 2026-05-20 (Om Chambers prod, queue 5003 "test") Severity: P1 — operator-visible misroute on queue timeout Fix: PR #251 (allow-list + validator), PR #252 (picker UX), PR #253 (picker state fix)

Symptom¶

Operator configures Queue → Edit → Timeout Destination in the editor, saves, re-opens the dialog: the type dropdown reverts to whatever was there before. Live calls that time-out on Max Wait either:

Play "this number is incorrect" (when stale type=phone, destination=5004 got dialled out the Tata trunk and rejected), OR
Fall through to (unavail) and play "all agents are busy"

depending on the stale data.

Root cause¶

api/src/routes/queues.js PUT handler's allowedFields allow-list was missing both timeout_destination and timeout_destination_type (and greeting_id, found alongside). The editor sent all three fields on every save; the API filtered them out before calling Queue.update(...). Whatever combo was last persisted by some other path (admin SQL, an earlier code version, a manual DB poke) stayed in the DB forever, the editor showed the stale data, and the dialplan generator used those stale values.

When the stale combo was {type:'phone', destination:'5004'} (Om Chambers' supervisors queue extension), the generator emitted Dial(PJSIP/5004@<trunk>, 30, tT). Tata's SBC rejected the call (5004 is not a valid PSTN destination), the dial returned CHANUNAVAIL, and the timeout routing silently failed.

Diagnostic commands¶

# 1. Read the DB row directly (this is what the dialplan actually uses)
ssh root@82.180.146.80 'cd /opt/astrapbx && node -e "
  const {Queue} = require(\"./src/models\");
  Queue.findOne({where:{number:\"5003\"}}).then(q =>
    console.log(JSON.stringify({
      number:q.number, name:q.name,
      timeout_destination:q.timeout_destination,
      timeout_destination_type:q.timeout_destination_type
    }, null, 2))
  );"'

# 2. Check what the generated dialplan emits
ssh root@82.180.146.80 'asterisk -rx "dialplan show 5003@org_<ctx>__queue" | grep -A2 "n(timeout)"'

# 3. Confirm the type-dialled context exists. For type=queue:
#    Goto(org_<ctx>__queue, <dest>, 1)
#    For type=extension:
#    Goto(org_<ctx>__internal, <dest>, 1)
#    For type=phone:
#    Dial(PJSIP/<digits>@<trunk>, 30, tT)   ← only valid for real phone numbers

Fix¶

Three PRs landed the full fix:

PR #251 added timeout_destination + timeout_destination_type + greeting_id to the PUT allowedFields array AND added a server-side validator (queues-helpers.validateTimeoutDestination) that rejects (type, destination) pairs the dialplan generator would misroute. 12 unit tests cover the rule set including the exact regression (4-digit "phone number" → 400).
PR #252 replaced the type+destination two-field combo in the editor with a smart picker: kind buttons [ No routing | User | Queue | Phone ] + a contextual SearchableSelect per kind. Operators no longer have to know which Asterisk context their destination lives in.
PR #253 fixed a derived-state bug in #252 where clicking the kind button visually toggled but the dropdown below didn't appear (mode was being derived from form values; clicking a button cleared the destination, which made the derived mode fall back to "none").

Cleanup for existing stale data¶

The code fix doesn't repair existing bad rows. After deploy, open the queue in the editor — the picker shows the stale state clearly (e.g. Phone: 5004) so the operator can correct it and save. The validator now rejects bad combos at save time, so it can't be re-broken.

Rule for future allow-lists¶

When adding columns the editor sends, add them to the PUT handler's allowedFields array in the same PR. The silent-drop pattern is hard to detect — the API returns 200, the editor shows a success toast, but nothing persisted. Test by saving + reopening, not by checking the response status.

Error 60: PJSIP reload deadlock after concurrent `reloadAsteriskConfiguration()` calls¶

Dates: 2026-05-19 (Tata path silent-drop), 2026-05-20 (Kolathur DID approve) Severity: P0 — all inbound DIDs play "number not in service" Fix: PR #255 — serialize reload calls + targeted module reloads

Symptom¶

After an admin action that triggers autoDeploy() (DID approve, queue save, user update, IVR save), every inbound Tata call plays "this number is incorrect" / "number not in service." The Asterisk dialplan looks correct on inspection; the dispatcher has the right Goto for every DID. But:

tcpdump shows OPTIONS from NUC (10.10.10.2:5060) arriving on prod's wg0 interface, with no reply transmitted.
pjsip show aor cloud-aor on NUC reports Unavailable (no response to qualify probes).
asterisk -rx 'pjsip reload' on prod returns "A module reload request is already in progress; please be patient" — for any command involving a reload, indefinitely.
pm2 logs astrapbx may show one or more 🔄 Reloading Asterisk configuration... lines without the matching ✅ Asterisk configuration reloaded line.
core show channels shows 0 active channels.

The only recovery is to SIGKILL the asterisk process and start it fresh.

Root cause¶

api/src/services/asterisk/configDeploymentService.js had ~18 call sites in server.js that fired reloadAsteriskConfiguration() with no serialization. Two admin actions in close succession (e.g. a DID approve immediately after a queue save) launched two concurrent asterisk -rx "core reload" shell calls. Asterisk's loader.c rejected the overlap with:

[<ts>] VERBOSE loader.c: The previous reload command didn't finish yet

Normally harmless. But when the first reload wedged inside res_pjsip — observed both incidents while it was loading a brand-new endpoint file (pjsip_<new_org>.conf) for the first time after a DID approve — the queue piled up. Every subsequent reload returned "previous reload didn't finish yet" forever, the CLI mutex never released, and SIP processing broke. tcpdump showed receives but no transmits because the reply path couldn't acquire the wedged mutex.

The smoking gun in /var/log/asterisk/full.log at the 2026-05-20 incident:

[May 20 17:17:37] VERBOSE[484509] loader.c: The previous reload command didn't finish yet
[May 20 17:17:37] VERBOSE[484512] loader.c: The previous reload command didn't finish yet
[May 20 17:17:37] VERBOSE[484515] loader.c: The previous reload command didn't finish yet

Three concurrent rejections, all 11 minutes after the 17:06 reload fired and stuck.

Diagnostic commands¶

# 1. Confirm the symptom: tcpdump shows receives but no transmits from NUC
ssh root@82.180.146.80 'timeout 6 tcpdump -i any -n -s 0 -nn -l "udp and host 10.10.10.2"'
# Expected (broken): only `wg0 In` lines, no `wg0 Out` from 10.10.10.1
# Expected (healthy): both In and Out, with `200 OK` replies

# 2. Probe the reload queue (this hangs when deadlocked)
ssh root@82.180.146.80 'timeout 6 asterisk -rx "pjsip reload"'
# Broken: "A module reload request is already in progress; please be patient" repeating
# Healthy: short response, then prompt returns

# 3. Smoking gun in messages.log
ssh root@82.180.146.80 'grep "previous reload command" /var/log/asterisk/full.log | tail -5'
# Multiple lines at the same timestamp = concurrent reload pile-up

Immediate recovery (when deadlocked)¶

This is what worked twice; takes ~15 s of downtime, 0 active calls is the expected state when deadlocked.

ssh root@82.180.146.80
PID=$(pgrep -x asterisk)
kill -9 $PID
sleep 3
rm -f /var/run/asterisk/*           # stale control socket
/usr/sbin/asterisk                  # daemon start
sleep 10
asterisk -rx "core show uptime"     # confirm CLI responsive
asterisk -rx "pjsip show aor tata_gateway" | grep -i avail
# tata_gateway should be Avail with RTT ~150ms within 30s

Permanent fix (PR #255, deployed 2026-05-20)¶

reloadAsteriskConfiguration() now serializes via an instance promise chain:

class ConfigDeploymentService {
  constructor() {
    // …
    this._reloadLock = Promise.resolve();
  }
  async reloadAsteriskConfiguration() {
    const previous = this._reloadLock.catch(() => {});
    this._reloadLock = previous.then(() => this._doReload());
    return this._reloadLock;
  }
  async _doReload() { /* the actual reload work */ }
}

Concurrent callers queue in JS — exactly one asterisk -rx shell call is in flight at a time.

Also replaced the heavyweight core reload (which reloaded every module including res_pjsip even when only ext_*.conf changed) with targeted module reloads matching exactly what the service rewrites:

dialplan reload                  → ext_*.conf
module reload res_pjsip.so       → pjsip_*.conf
module reload app_queue.so       → queues_*.conf

Plus a 750 ms settle delay before seedQueueMemberDevstates() so the per-member asterisk -rx "devstate change …" CLI calls don't race the tail-end of the reload sequence on Asterisk's CLI mutex.

Verification post-fix¶

# Fire two concurrent regen-gateway calls
ssh root@82.180.146.80
KEY=$(grep ^INTERNAL_API_KEY /opt/astrapbx/.env | cut -d= -f2-)
curl -s -X POST -H "X-Internal-Key: $KEY" http://localhost:8000/api/v1/admin/regenerate-gateway &
sleep 0.3
curl -s -X POST -H "X-Internal-Key: $KEY" http://localhost:8000/api/v1/admin/regenerate-gateway &
wait

# Confirm BOTH logged as separate reloads (serial), no collision errors
pm2 logs astrapbx --lines 60 --nostream | grep -E 'Reloading|reloaded \(dialplan'
# Expect 2 "🔄 Reloading…" + 2 "✅ … reloaded (dialplan + res_pjsip + app_queue + devstate seed)"

grep "previous reload command" /var/log/asterisk/full.log | tail -5
# Expect empty (no collisions)

Rule for future reload paths¶

Never call core reload from the API. Targeted module reloads only. Each new file type you generate needs its own targeted reload command, not a blanket "reload everything."
Any new call site that fires reload-affecting CLI commands MUST go through reloadAsteriskConfiguration() so the serialization applies. Don't add raw asterisk -rx "module reload …" calls elsewhere.
The 30-second grace window in pollCdr is unrelated — that's the CDR ingest classifier, not the reload path. Don't conflate them.

Updated 2026-05-22: PR #255 doesn't fully eliminate the wedge¶

The 2026-05-22 ~10:23 IST incident on prod showed that even with PR #255's _reloadLock promise-chain serialization, a single reload can still wedge inside res_pjsip. The wedge happens when the reload thread walks PJSIP sessions and hits one in a transient state (mid-call cleanup, mid-hangup-handler, etc.) — Asterisk's session-walk has a known race that occasionally hangs there.

Once one reload wedges, the reload mutex is held forever:

asterisk -rx "module reload res_pjsip.so"
# returns: "A module reload request is already in progress; please be patient"
# (forever, until the asterisk process is killed)

Calls that are mid-hangup at the moment of wedge get permanently stuck in the hangup handler. Their channel structs survive in core show channels listing, accumulate in the Editor's Live Calls UI as "zombies", and won't release until Asterisk is restarted.

PR #255 reduces the frequency of wedges (no concurrent reloads piling up to amplify the bug surface) but does not eliminate it. The recovery procedure is unchanged: SIGKILL + restart fresh.

Detection: zombie-channel watchdog (2026-05-22)¶

The API now runs a defensive periodic check (every 15 min) via api/src/services/zombieChannelWatchdog.js. Per tick:

Reads core show channels concise and classifies stuck channels:
context = '*__hangup' AND exten = 'h' AND age > 5 min → hangup-handler stuck
state = 'Down' AND age > 2 min → Down state stuck
Tries channel request hangup on each. Works for simple zombies.
If most/all stuck channels survive the hangup → confirms Error 60 signature (channel request hangup is a no-op because the reload mutex is wedged).
On confirmed signature: opens a GitHub issue with the auto-zombie-alert label, @-mentions the on-call operator (env: GH_OPS_MENTION, default @harisuryaa). GitHub Mobile push notification fires. Issue body contains the exact SIGKILL+restart command sequence.
Auto-closes the issue after 2 consecutive clean ticks (signals successful recovery).

Watchdog never auto-restarts Asterisk — operator authorization required per CLAUDE.md Rule 3.

Zombie-channel safe characteristics¶

The watchdog targets only patterns impossible-by-design for real calls. Documented for posterity so the rules can be re-derived if the code is lost:

A channel in h@*__hangup exten only enters that state AFTER the caller has disconnected. The hangup handler exists to stamp CDR fields + fire webhooks; it must finish in milliseconds. Anything >5 min there is a zombie.
A channel in Down state has no audio path. Asterisk's reaper normally cleans these in seconds. Anything >2 min there is a zombie.

The watchdog deliberately ignores long-running Up/Ring/Dial channels — those can legitimately last 30+ min on hospital queue lines, agent-on-call sessions, etc.

Updated 2026-05-23: root cause chain fully identified (PR #293 + #295)¶

The 2026-05-22 prod incident (Fintax, 6 zombie channels, 40+ min wedge) and the deterministic staging reproduction on 2026-05-22 evening during sticky-agent UAT exposed two distinct root causes that PR #255's serialization didn't cover. Both required fixes.

Root cause 1 — Unmutex'd reload paths (PR #293)

The editor's Users page called pbxConfig.reload() on every save (debounced via scheduleDeploy). That hit POST /api/v1/config/reload which called asteriskManager.coreReload() — an AMI core reload action that bypassed the _reloadLock mutex from PR #255 entirely. Meanwhile the server PUT handler fired its own mutex'd configDeploymentService.reloadAsteriskConfiguration() via the targeted asterisk -rx "module reload …" path. Both fired on the same status toggle, ~750 ms apart, landing in Asterisk's reload thread concurrently.

PR #293 refactored POST /api/v1/config/reload to call configDeploymentService.reloadAsteriskConfiguration() (the mutex'd targeted-reloads path) instead of the AMI bypass. Every reload from any source now serializes through the same _reloadLock. Also removed the redundant client-side scheduleDeploy since the server's autoDeploy on PUT already handled it.

Root cause 2 — Time-domain race inside Asterisk's reload thread (PR #295)

Even with all reloads routed through the same Node-level mutex, the wedge could still happen because asterisk -rx "module reload res_pjsip.so" returns from the CLI when the request is accepted, not when the reload actually completes inside Asterisk's reload thread. On a busy org the reload takes 2-5 seconds; the existing 750 ms post-reload sleep was empirically too short. If a second mutex'd reload fired ~750 ms after the first, it could land while the first was still processing → wedge.

PR #295 replaced the fixed sleep with active completion polling:

Capture Last reload age from core show uptime BEFORE firing the reload
Fire the reload via asterisk -rx
Poll core show uptime every 200 ms until the sampled age drops below the captured value (proves a new reload completed)
10 s timeout per reload step — on timeout, log a warning and proceed

Implementation in configDeploymentService._waitForReloadComplete(), helpers parseUptimeToSeconds() + _readLastReloadAgeSec(). Unit tests in api/tests/config-reload-uptime-parser.test.js (PUS1-PUS10) lock the parser shape.

Result: each save now takes 6-15 s (the actual reload duration is now visible — was previously hidden behind the 750 ms sleep). The floating progress pill was shipped alongside to give the operator real-time feedback during the wait.

Verification post-fix

After PR #295 merged to prod (2026-05-23 PR #300 promotion), repeated stress-test on staging:

# Toggle a user's status 5 times in rapid succession
for i in 1 2 3 4 5; do
  curl -s -X PUT -H "X-API-Key: $KEY" \
    -H "Content-Type: application/json" \
    -d "{\"status\":\"$( [ $((i%2)) = 0 ] && echo inactive || echo active)\"}" \
    https://stageeditor.astradial.com/api/v1/users/<user_id>
done

# After completion, verify endpoint state is healthy (no Invalid):
ssh root@94.136.188.221 'asterisk -rx "pjsip show endpoint <endpoint>" | grep Endpoint:'
# Should show: Unavailable / Not in use — NEVER Invalid

# Verify module reload still functions (not wedged):
ssh root@94.136.188.221 'asterisk -rx "module reload res_pjsip.so"'
# Should succeed within ~3-5 s — NEVER "already in progress"

Both passed cleanly across 50+ rapid toggles across DIDs, queues, users, IVR saves. The two root-cause patches together close the dominant production trigger. Watchdog from PR #289 stays in place as the safety net for any unknown trigger that may surface.

Open: what wasn't fixed

Hypothetical remaining triggers (not observed in production, but possible): - PM2 cluster-mode running multiple Node processes — each would have its own _reloadLock and could still race against each other. Not applicable today (PM2 fork mode, 1 instance), but worth noting for future scale-out. - Direct shell access by an operator firing asterisk -rx "module reload res_pjsip.so" while a deploy is in flight. Not currently gated.

If a wedge ever recurs, capture pm2 logs astrapbx --lines 60 --nostream | grep -E "🔄|✅|reload" to confirm — the active wait should now print each reload's actual duration, making any time-domain anomaly visible.

Error 61: Trunk `max_channels` stored in DB but never enforced — UI lies about concurrent-call cap¶

Dates: 2026-05-20 Severity: P2 — Trunk "Channels" field on the editor is a knob with no effect; org's limits.concurrent_calls is the only real cap Fix: PR #260 (dialplan enforcement) + PR #262 (effective-cap UI) + PR #263 (CDR cap_rejected userfield + call-logs badge)

Symptom¶

In the editor Trunks page, every trunk shows a Channels field (e.g. Tata SIP Trunk = 50). Operator sets it lower (e.g. 10) and saves successfully. But the trunk continues to accept 11+ concurrent outbound calls — the cap appears to do nothing.

Simultaneously, the admin Organization page shows a different "Concurrent Calls" number (e.g. 10) under organizations.limits.concurrent_calls, with no indication of how it relates to the trunk's "Channels" field.

Root cause¶

Two issues stacked:

POST /api/v1/trunks allowedFields did not include max_channels. The editor sent the value; the API silently dropped it; the DB stored the model default (50). This is the silent-drop pattern (see also Error 50, Error 51, Error 59) — API returns 200, editor shows success toast, but the field never persists.
The dialplan never read sip_trunks.max_channels at all. dialplanGenerator.generateOutboundContext() only counted concurrent calls against the org-level limits.concurrent_calls via a single GROUP(orgCap)/GROUP_COUNT(orgCap@app) pair. There was no per-trunk counter, so the trunk "Channels" knob had nothing reading it on the call path.

So the editor showed two caps (org + trunk) but only one (org) was ever enforced.

Diagnostic commands¶

# Confirm DB value matches what UI showed (per-org)
ssh root@82.180.146.80 'mysql -u root pbx_api_db -e "
  SELECT name, max_channels FROM sip_trunks WHERE organization_id = '\''<org-uuid>'\'';"'

# Confirm dialplan does NOT have a per-trunk GROUP() block (broken state)
ssh root@82.180.146.80 'grep -A 2 "GROUP(.*trunkCap" /etc/asterisk/ext_<org>.conf'
# Empty output = bug present

Fix (PR #260)¶

generateOutboundContext() now emits a per-trunk concurrency check before the actual Dial:

exten => _X.,n,Set(GROUP(trunkCap_<trunkId>)=trunkCap_<trunkId>)
exten => _X.,n,GotoIf($[${GROUP_COUNT(trunkCap_<trunkId>@trunkCap)} > ${TRUNK_MAX}]?trunk_limit_reached,1)

Both org and trunk caps fire before the Dial. The effective cap a caller actually experiences is min(org_cap, trunk_cap).

The earlier allowedFields silent-drop was fixed in the same PR by adding max_channels to the POST handler's allowed-fields array.

How operators tell which cap blocked a call¶

PR #263 stamps the CDR userfield column at the rejection point:

trunk_limit_reached,1,Set(CDR(userfield)=trunk_cap_rejected)
org_limit_reached,1,Set(CDR(userfield)=org_cap_rejected)

The /api/v1/calls SQL pulls userfield into cap_rejected ('org' or 'trunk') on each row. The call-logs page shows a destructive-variant badge "Org cap" or "Trunk cap" instead of the normal Completed/Missed status (PR #263, editor/app/dashboard/[orgId]/calls/page.tsx).

How operators see the effective cap (PR #262)¶

Trunks page shows the effective cap on the "Channels" column (e.g. 10 / 50 = trunk allows 50 but org caps at 10, so 10 wins). Edit and Create dialogs include a helper line "Effective cap: min(trunk, org)". Admin org page also shows the same effective number.

Prevention rule¶

When adding a UI knob whose effect requires dialplan enforcement:

Read the dialplan generator. If there's no code that consumes the field at call time, the knob is decorative.
Test by exceeding the cap on real traffic (place N+1 concurrent calls), not by checking that the field persisted.
Stamp CDR userfield at rejection points so operators can tell which cap blocked the call. Stats endpoints + call-logs UI should surface this distinction.
See features/concurrent-call-cap.md for the full architecture.

Error 62: Call-logs shows "Internal" for softphone-originated PSTN outbound calls¶

Dates: 2026-05-20 Severity: P2 — ~19% of outbound calls miscategorised in call-logs + dashboard stats Fix: PR #265 (staging) → PR #266 (production)

Symptom¶

Agent picks up a softphone, dials a PSTN number (e.g. customer mobile), call connects and completes normally. The corresponding row in /dashboard/<orgId>/calls shows Direction = Internal instead of Outbound.

The dashboard's per-day call-volume chart undercounts outbound by the same magnitude.

Root cause¶

Asterisk's CDR records dcontext as the originating context, not whichever included context contained the matched extension pattern.

Softphone PJSIP endpoints have context=<org>__internal. The outbound dial patterns (_X., etc.) live in <org>__outbound and are pulled into __internal via include => <org>__outbound. When the caller dials a PSTN number, Asterisk:

Matches the pattern in __outbound (via the include)
Executes the Dial against the trunk
Writes CDR with dcontext='<org>__internal' (the originating context)

The previous direction CASE in api/src/server.js was:

WHEN t.dcontext = 'ai-outbound' THEN 'outbound'
WHEN t.dcontext LIKE '%incoming%' THEN 'inbound'
WHEN t.dcontext LIKE '%outbound%' THEN 'outbound'
ELSE 'internal'

So <org>__internal fell to the ELSE branch and got labeled "internal" — even when the call went out to PSTN via a trunk.

Fix (PR #265/#266)¶

Added a fourth CASE branch that catches the softphone→trunk pattern:

WHEN t.dcontext LIKE '%internal'
 AND t.lastapp = 'Dial'
 AND t.lastdata LIKE '%@%trunk%' THEN 'outbound'
ELSE 'internal'

All three signals are required: - dcontext LIKE '%internal' — call originated in a softphone's home context - lastapp = 'Dial' — last app was a Dial (rules out PlayBack, queue announce, etc.) - lastdata LIKE '%@%trunk%' — Dial argument referenced a trunk endpoint (rules out internal extension-to-extension Dial)

Applied in 3 places in api/src/server.js: - Row-level CASE in main GET /api/v1/calls query - Row-level CASE in secondary single-line query - Stats SUMs (SELECT … FROM asterisk_cdr weekly + totals) so dashboard counters agree with row labels

Verification on real prod data before merge¶

Ran the old vs new CASE side-by-side over the last 7 days (2257 rows):

Old direction	New direction	Count
inbound	inbound	1202
internal	internal	974
internal	outbound	429 ← bug fix
outbound	outbound	52

Zero false flips. Spot-checked 5 of 429 — all genuinely outbound (org's outbound DID as src, 10-digit customer number as dst, channel from softphone endpoint, dstchannel to trunk endpoint).

Prevention rule¶

dcontext alone is unreliable for direction inference. When dialplan A include =>s dialplan B, Asterisk records A as the dcontext even when the executing logic came from B. Future direction classifiers must look at lastapp and lastdata together, not just dcontext. For new outbound patterns, prefer either:

A dedicated outbound context that's the originating context for the channel (set via Goto before the Dial), so dcontext alone is sufficient, OR
Multi-signal CASE branches like the one above, with verification on real CDR data before merging.

Error 63: Editor deploy crash-loops on `EADDRINUSE` / binds wrong port — orphaned `next-server` + dropped `PORT`¶

Dates: 2026-06-05 (surfaced on the first post-Mumbai prod promotion) Severity: P2 — editor.astradial.com / stageeditor left crash-looping or bound to the wrong port after a deploy Fix: PR #364 (orphan-proof restart) + PR #365 (PORT) → promoted to prod via PR #366

Symptom¶

After an editor deploy (Deploy Editor to {staging,production}), one of:

The workflow reports success, but minutes later pm2 jlist shows editor = errored, pid=0, restart count climbing — while editor.astradial.com still serves (307) via an orphaned process. pm2 has lost control: no auto-recovery, and the next deploy crash-loops the same way. In the browser this can present as dead buttons / ChunkLoadError (stale chunks served by the orphan).
Or the editor comes up online but on port :3000 instead of :3001 → the reverse proxy (expecting 3001) can't reach it → editor effectively down. The deploy's verify step (curl :3001) fails.

Root cause¶

Three compounding issues:

pm2 restart orphans the server. pm2 only kills the PID it tracks (the npm parent). npm start → next start → next-server (grandchild) outlives it, keeps holding *:3001, and the new instance crash-loops on EADDRINUSE until pm2 gives up (errored). The orphan keeps serving the old build.
The pre-empt kill missed the orphan. The old "kill orphan" step used lsof -ti:3001, which can miss the IPv6 *:3001 bind next-server uses — so the orphan survived its own kill step.
PORT is not read from .env by next start. When the fix replaced pm2 restart --update-env (which preserved the original PORT) with pm2 delete + a fresh pm2 start, the stored PORT was dropped → editor defaulted to :3000. (Caught on the first staging validation of the fix — exactly what staging-first is for.)

Fix (PR #364 + #365, both editor deploy workflows)¶

The deploy's restart sequence is now orphan-proof and port-explicit:

pm2 delete editor 2>/dev/null || true
# free :3001 by NUMERIC PID via ss (catches the IPv6 bind lsof misses); retry loop
ss -ltnHp 'sport = :3001' | grep -oE 'pid=[0-9]+' | cut -d= -f2 | xargs -r kill -9
PORT=3001 pm2 start npm --name editor -- start   # PORT must be a real env var
pm2 save

The verify step now retries for ~45s (next-server bind is slow) instead of a single sleep 5.
⚠️ Never pkill -f "next…" to clear orphans in a deploy/SSH step — that pattern also matches the running shell's own command line and kills the job/session (hit this mid-fix; exit 255). Kill by numeric PID only.

Durable root fix (PR #377) — single-process standalone build¶

The deploy-hardening above only covered the deploy restart. The orphan recurred on a post-deploy crash-restart (during the On Call Features prod promotion): after a successful deploy the editor crashed, pm2 auto-restarted npm start, and that restart — which doesn't go through the hardened deploy step — re-orphaned the next-server child. The real fix is to remove the child entirely:

editor/next.config.ts: output: 'standalone' → the build emits .next/standalone/server.js, a self-contained Node HTTP server.
pm2 runs node server.js directly — a single process it fully owns, with no npm/next-server child to orphan on ANY restart (deploy or crash).
The .env wrinkle (standalone server.js does NOT load .env, unlike next start) is solved with Node 20.6+ --env-file (the boxes run v20.20) — no dotenv dependency:
```
PORT=3001 pm2 start .next/standalone/server.js --name editor \
  --node-args="--env-file=/opt/pipecat-flow-editor/.env"
```
The build must copy .next/static + public/ into .next/standalone/ (standalone omits them).
Validated: 3× pm2 restart editor → single listener every time, zero EADDRINUSE; /api/pbx proxy (rewrites) works (--env-file loads NEXT_PUBLIC_PBX_URL + the server-side ADMIN_*/*_API_KEY vars).

After #377 the deploy-hardening (kill-by-PID, ss) is still in place as belt-and-suspenders, but the orphan can no longer form. No more manual reconciles on prod promotions.

⚠️ One-time transition gotcha (first standalone deploy on a box only). Migrating a box from the old npm start to standalone: the old npm start process — reparented to init (ppid 1) by an earlier manual reconcile — keeps RESPAWNING next-server on :3001, so the new node server.js can't bind and pm2 goes errored. The deploy's ss-kill frees the listener but the npm parent respawns it, so the automated deploy can't displace it. Hit on the prod On Call Features promotion (2026-06-06). One-time fix:
pm2 delete editor
# kill the OLD npm tree by PID: next-server -> sh -c next start -> npm start (all three)
ps -eo pid,cmd | grep -iE 'next-server|sh -c next start|npm start' | grep -v grep   # get PIDs
kill -9 <those-pids>
# confirm :3001 free, then fresh standalone start
cd /opt/pipecat-flow-editor
PORT=3001 pm2 start .next/standalone/server.js --name editor \
  --node-args="--env-file=/opt/pipecat-flow-editor/.env"
pm2 save
Verify pm2 actually OWNS the listener: the pm2 editor pid MUST equal the :3001 listener pid (mismatch = orphan still present). Then restart-stress (pm2 restart editor ×3-4): a healthy standalone gives a new pid each time, pm2-owned, zero EADDRINUSE. After this one-time cleanup, all subsequent deploys/restarts are clean. (Never pkill -f next… — it kills your own SSH shell.)

Manual recovery (if a deploy leaves the editor wedged)¶

ssh root@<vps>            # prod 82.180.146.80 / staging 94.136.188.221
pm2 stop editor            # halt the auto-restart loop FIRST (bare `pm2 restart` just re-crashes)
for p in 3000 3001; do ss -ltnHp "sport = :$p" | grep -oE 'pid=[0-9]+' | cut -d= -f2 | xargs -r kill -9; done
cd /opt/pipecat-flow-editor && PORT=3001 pm2 start npm --name editor -- start
pm2 save
# verify: pm2 jlist (status online, restarts steady), single listener on :3001, curl localhost:3001 = 307

Prevention rule¶

Replacing pm2 restart --update-env with pm2 delete + fresh pm2 start drops the stored env (notably PORT) — always set required runtime env (PORT=3001) explicitly on the fresh start.
Use ss, not lsof, to find port holders (IPv6 reliability).
Validate editor-deploy workflow changes on staging before promoting — dispatch deploy-editor-staging.yml via workflow_dispatch --ref <branch> against the staging box. This caught both the orphan and the PORT bug before they reached prod.
Ports for reference: editor 3001, astrapbx API 8000, workflow-engine 3002. Prod/staging deploys run on self-hosted runners (mumbai-prod-* / staging-vps-platform).

Error 64: Phone-forward calls lose the customer caller ID (shows the DID) — Live Calls + CDR¶

Dates: 2026-06-06 Severity: P1 for hospitals — agents (and call logs) couldn't see who was calling Fix: PR #374 (staging) → promoted to prod in the On Call Features batch

Symptom¶

For a phone-forward user (ring_target=phone, e.g. a "Personal" extension that rings a staff mobile), Live Calls showed the business DID as the caller, and the CDR src stored the DID — not the real customer number. SIP-agent calls were unaffected (they always showed the real caller — confirmed in prod CDRs).

Root cause¶

The dialplan generator emitted, on the phone-forward leg:

exten => <ext>,n,Set(CALLERID(num)=<org DID>)
exten => <ext>,n,Dial(PJSIP/<mobile>@<trunk>,...)

The Set(CALLERID(num)=DID) was there so the staff mobile shows the business number AND so the Tata SBC gets a valid owned-DID From (it rejects/substitutes an unowned From — see Error from 2026-05-16). But it overwrote the inbound channel's CallerID, erasing the customer's number — which is what Live Calls (AMI CoreShowChannels) and the CDR src read. The NUC does not touch caller ID; this was purely the VPS generator.

Fix (PR #374)¶

Present the DID on the outbound leg only, via a Dial b() pre-dial handler, instead of overwriting the inbound channel:

; in the org internal context (emitted once):
exten => fwdcid,1,Set(CALLERID(num)=${ARG1})
exten => fwdcid,n,Return()
; phone-forward Dial:
exten => <ext>,n,Dial(PJSIP/<mobile>@<trunk>,30,tTb(<ctx>_internal^fwdcid^1(<DID>)))

The b() runs the Gosub on the outbound channel before it dials, so the mobile sees the DID and Tata gets a valid From — while the inbound channel keeps CALLERID = customer, so Live Calls + CDR store the real caller. Applied to both phone paths: the direct user extension and the queue-member (qm) helper.

Prevention rule¶

To set an outbound caller ID without losing the inbound one, use the Dial b(context^exten^pri(arg)) pre-dial handler — never Set(CALLERID(num)=...) on the bridged inbound channel. The inbound channel's CallerID is what AMI/CDR report.

Error 65: Live Calls From/To swaps every refresh; Transfer hits the wrong number¶

Dates: 2026-06-06 Severity: P2 — operator confusion; wrong-number transfers on prod Fix: PR #375 (staging) → promoted to prod in the On Call Features batch

Symptom¶

The same live call showed its From/To reversed between 3-second auto-refreshes, and Transfer sometimes redirected the wrong party (e.g. moved the agent/destination instead of the customer).

Root cause¶

A bridged call has 2+ Asterisk channels (legs) sharing a linkedid; each leg's CallerID/ConnectedLine are opposite (caller leg: from=customer/to=agent; callee leg: the reverse). /calls/live dedupes to one row per linkedid, but the representative-leg pick was order-dependent (an isLocal/isTrunk/direction heuristic) — so a different leg was chosen on each AMI snapshot, flipping From/To. And the chosen leg's channel_id is what Transfer redirects — if it was the callee leg, Transfer moved the wrong party.

Fix (PR #375)¶

Pick the representative deterministically — always the PRIMARY leg (uniqueid === linkedid, the originating/caller channel), ties broken on channel_id so the result is identical regardless of AMI ordering:

const repScore = (ch) => (ch.uniqueid === ch.linkedid ? 100 : 0) + (ch.channel_id.startsWith('Local/') ? 0 : 10);
// keep highest score per linkedid; tie-break: smaller channel_id

The primary leg's CallerID = the real caller and ConnectedLine = the real destination (stable, correct From/To), and its channel_id is the caller leg — so Transfer redirects the customer.

Prevention rule¶

When collapsing multi-leg Asterisk calls to one display row, choose the representative deterministically (prefer uniqueid === linkedid), not by a heuristic over the AMI snapshot — snapshot ordering is not stable across polls.

Error 66: Webhooks silently deliver nothing on MariaDB (`Op.contains` is Postgres-only)¶

Dates: 2026-06-07 Severity: P2 — call/queue webhooks never fired; integrations got no events Fix: PR #381 (staging)

Symptom¶

call.* and queue.* webhooks (and the POST /api/v1/test/call-event simulator) delivered nothing, while CRM webhooks (crm.client.created, …) delivered fine.

Root cause¶

triggerWebhooks() in api/src/server.js selected matching webhooks with Sequelize Op.contains, which is PostgreSQL-only. On MariaDB it matched zero rows, so every webhook fired through that path was silently dropped. CRM webhooks used a different path (webhookService.deliverWebhook, which filters in JS) and worked.

Fix¶

triggerWebhooks() now delegates to webhookService.deliverWebhook — JS-side event matching (events.includes(eventType)), plus HMAC signing (X-PBX-Signature), retries, and delivery-status recording. Verified live on staging (test → call.initiated → call.answered all delivered).

Prevention rule¶

Never use Op.contains / array-containment operators in this codebase — the DB is MariaDB, not Postgres. Match JSON-array membership in JS after a plain findAll. See Webhooks.

Error 67: STD-landline (single-channel FXO) inbound call's transfer/forward hits BUSY and drops¶

Dates: 2026-06-07 Severity: P2 — Mithra (Airtel FXO) calls dropped on transfer Fix: PR #379 (staging)

Symptom¶

A call that arrived on a single-channel FXO / STD-landline DID, when transferred or forwarded out, got an instant BUSY and the caller was dropped.

Root cause¶

With per-line outbound routing on, the inbound-DID subroutine pinned __OB_TRUNK_EP to the trunk the call arrived on. For a 1-channel FXO line that one channel is already occupied by the call itself, so its own outbound leg (transfer/forward/queue-ring) dialed back out the busy trunk → BUSY → caller dropped.

Fix¶

In dialplanGenerator.js generateDidSubroutine, skip the __OB_TRUNK_EP pin when the DID trunk is single-channel (max_channels === 1) or trunk_type === 'inbound' — the resolver then falls back to the org's default multi-channel trunk (e.g. Tata). Multi-channel and max_channels = 0 (unlimited) still pin; gate-off orgs unchanged. Proven on staging (AstraPrivate BSNL): the forward leg used org_..._tata, caller kept.

Prevention rule¶

A single-channel inbound line can never carry a second concurrent leg — don't pin a call's outbound legs to the trunk it arrived on when that trunk is 1-channel/inbound-type. See the per-line outbound routing generator.

Error 68: Failed phone-forward drops the caller into dead air¶

Dates: 2026-06-07 Severity: P2 — transferred/forwarded callers silently disconnected Fix: PR #379 (staging)

Symptom¶

A queue caller transferred to a ring_target=phone extension whose external number didn't answer was silently disconnected — no announcement.

Root cause¶

The phone-forward branch in generateUserExtension ended a failed external Dial with an unconditional Goto(end) → Hangup() — no DIALSTATUS handling.

Fix¶

After the external Dial, branch on DIALSTATUS (NOANSWER → announce, BUSY → busy) like the SIP path — never to (unreachable), preserving the no-failover-on-busy/no-pickup rule. The caller now hears the unavailable announcement instead of dead air.

Prevention rule¶

Every outbound-leg Dial that can fail must route the caller somewhere audible on DIALSTATUS, not Hangup() silently.

Error 69: "Person at extension N is not available" announce plays only the digits¶

Dates: 2026-06-07 Severity: P3 — confusing UX (caller hears a bare "1004") Fix: PR #380 (staging)

Symptom¶

The no-answer / unavailable announce played only the extension digits (e.g. "one zero zero four") with no surrounding words.

Root cause¶

The announce referenced custom sound files the-person-at-exten and is-not-available that ship with no Asterisk install. Playback failed silently (File … does not exist in any format); only SayDigits worked. Pre-existing on every no-answer announce; surfaced by the Error 68 fix routing failed forwards to this announce.

Fix¶

Use the stock voicemail prompts present in every install: the-person-at-exten → vm-theperson ("the person at extension"), is-not-available → vm-isunavail ("is unavailable"). Applies to both announce blocks (active-user no-answer + inactive-user-with-failover).

Prevention rule¶

Only reference sound files that ship with Asterisk (or are deployed to /var/lib/asterisk/sounds). Verify with ls /var/lib/asterisk/sounds/en/<name>.* before adding a Playback.

Error 70: API docs "Try It" → "Network Error" behind Cloudflare Access¶

Dates: 2026-06-07 Severity: P3 — in-browser API testing broken in the Reference Fix: PR #386 (staging)

Symptom¶

"Send API Request" in the /reference (Stoplight Elements) docs returned "Network Error", while the same call worked in Swagger (/api-docs).

Root cause¶

The docs hostnames are behind Cloudflare Access, and the API returns Access-Control-Allow-Origin: * with Access-Control-Allow-Credentials: true — an invalid combo browsers block for cross-origin credentialed requests. Swagger worked because it called the same origin; Elements was hitting an absolute server (and not sending the CF Access cookie).

Fix¶

Make the spec's first/default server a relative /api/v1 so Try-It always calls the same origin → no CORS check, CF Access cookie sent. Dropped the unusable http://<ip>:3000 / localhost servers.
Set tryItCredentialsPolicy="include" on <elements-api>.

Prevention rule¶

For browser try-it behind Cloudflare Access, keep the OpenAPI server relative (same-origin) — never an absolute cross-origin URL with ACAO: * + credentials. See Interactive API Docs.

Error 71: "Many calls not reaching" — root cause was a loose cable, not software¶

Dates: 2026-06-07 Severity: P1 (customer-reported) — turned out physical Fix: Customer reseated the cable; no code change

Symptom¶

Mithra Scans reported "many calls not reaching." A queue call showed Queue MRI ANSWERED 47s / Not answered / Missed with no member-ring leg — the queue appeared to not ring available agents.

Root cause¶

Intermittent loose cable at the customer premises: the landline agent phones lost connectivity intermittently, so when a call landed during a drop the queue (ring_inuse=no) found no reachable member and the caller abandoned. The cloud was healthy throughout — over 3h, 57/58 inbound calls reached someone (~98%); both trunks Avail; DIDs routed; a live monitor showed the queue ringing free agents correctly and "In use" readings matched real CDR calls.

Prevention rule / diagnosis pattern¶

When "calls not reaching" with no cloud-side errors: confirm the agent reach rate (CDR qmem legs with billsec>0 per linkedid) and whether the queue ever rang members (no member-dial leg = not rung). If infra (trunks Avail, no SIP errors, queue rings free agents in a live queue show monitor) is clean, suspect the premises (cable/NAT/registration), not the dialplan. Calls that bounce off a busy single-channel line never create CDRs and are invisible — check per-DID concurrency limits too.

Error 72: Patient's missed call shows 78001 → callback lands in the staging test org¶

Date: 2026-06-08 Severity: High (latent) — patient-safety fault line for hospitals; no active outage found

Symptom¶

A Thangavelu Hospital patient (9688868241) showed up on the AstraPrivate staging softphone (org_mna9x47k_1001) calling +91 80659 78001. Concern: "calls only show 78001 when something is wrong — a hospital call must have leaked to the fallback DID."

What the trace actually found¶

Full sweep of both VPSes — asterisk_cdr (the only live CDR sink; the cdr_raw/pbx_db sink in cdr_adaptive_odbc.conf is configured but the table does not exist), call_records, and full.log (Apr→Jun):

The screenshot call was a plain inbound: 919688868241 → 918065978001, org_mna9x47k__incoming → queue 5002 → rang ext 1001 → answered 11s. Routed Tata → NUC → prod dispatcher → WireGuard → staging (78001 is routing_environment='staging').
No platform outbound ever presented 78001 to this number, on either box, in two months. The only outbound to the patient (2026-05-16) presented the correct Thangavelu DID 78012.
So the premise "a callback to 78001 ⟹ a call was triggered using 78001" was contradicted. The inbound call itself created the only 78001 association — most likely a misdial of 78012, or a test line (the same number hit Thangavelu bd5706c3 in May and AstraPrivate 7f3d2fd5 today). This specific call was benign.

Root cause (the real issue the trace exposed)¶

+918065978001 had a dual role: it is AstraPrivate's live staging DID and it was the NUC outbound-CID fallback default (from-cloud cid_default). Because 78001 is routing_environment='staging', a PSTN callback to it is bridged into the AstraPrivate test org. So if any prod hospital outbound call ever fell back to 78001 (the Error 33 / per-user-DID class, or an org with no usable DID), the patient returning that missed call would land on a test softphone answered by no clinician — never their hospital. (See Outbound Caller ID resolution and Error 33.)

Fix — Layer 1 (containment, LIVE on prod, verified)¶

NUC from-cloud cid_default: +918065978001 → +918065978000 (a dedicated, unassigned sentinel). Backed up extensions.conf.bak-2026-06-08, graceful dialplan reload (no restart); live dialplan confirms [cid_default] Set(CALLERID(all)=+918065978000).
Reserved +918065978000 in the marketplace: did_numbers.pool_status available → reserved, no owning org. A callback to it hits tata-did-route's _X. catch-all → Playback(number-not-in-service) → Hangup() — safe, never a tenant/test org.

78001 remains AstraPrivate's working staging DID (its inbound bridge and its own outbound CID in ext_from_cloud.conf are untouched).

Fix — Layer 2 (prevention, PR #424 — merged to staging, verified; prod promotion pending)¶

generateOutboundContext blocks outbound for an org with an active route but no assignable caller-ID DID (no override, no default, no assigned+active DID) — refuses the call (cannot-complete-as-dialed, CDR userfield=no_caller_id_did) instead of letting an extension/empty CID leak to the sentinel. Prod blast-radius audit: only Fintax Advisory (DID inactive; last outbound 2026-05-25, already leaked the sentinel 2×) and PVS GLOBAL NETWORK (0 outbound ever) — both effectively inactive; assign each a valid DID if they should dial out.

Status: merged to staging (commit c3a1d89) and deployed — the guard is live in the deployed generator and AstraPrivate's regenerated outbound context is unaffected (resolves its DID 78001 → Dial, not blocked), confirming no regression for DID-holding orgs. Still pending before prod promotion: (1) live block-path E2E (a throwaway no-DID org → confirm cannot-complete-as-dialed); (2) decide Fintax/PVS (assign a DID or accept the block). Do NOT treat the guard as shipped to prod until promoted.

Prevention rule¶

A caller-ID fallback sentinel must always be an unassigned/reserved DID — never a live tenant's number, and never one routed into another (especially a test) org. When asserting a negative in a call trace ("no call ever did X"), enumerate every CDR sink first (cdr_adaptive_odbc.conf) — here a configured-but-nonexistent cdr_raw would otherwise look like an unchecked gap.

Error 73: MOH dropping + choppy voice on ALL orgs — hypervisor vCPU stalls on the prod VPS host node (invisible: steal=0)¶

Dates: reported 2026-06-10, fixed (Contabo host migration) + verified 2026-06-11 Severity: P1 — degraded audio on every call, all orgs, 24/7 Fix: Contabo live-migrated the VM to another host node (ticket with evidence); zero downtime

Symptom¶

Music-on-hold cutting out, low/choppy voice quality on hospital calls — on the Mumbai prod box (82.180.146.80), i.e. after the France→India migration had already fixed the transit-loss problem. Affected prod AND staging-bridged calls (staging media transits prod: Tata → NUC → prod → wg → staging). Standard dashboards showed nothing: CPU 94% idle, 40 GB RAM free, steal 0, no NIC drops, disks fine.

Root cause¶

The Contabo host node was descheduling the VM's vCPUs for 10–140 ms bursts, ~3–5 % of the time, around the clock. Reported steal time was 0 (host didn't expose it), so the contention was invisible to every in-guest metric — only its effects showed. Asterisk needs a steady 20 ms tick to generate MOH/playback and pace RTP; with ~4 % of wakeups stalled it was emitting bursty, gappy audio. The "packet loss" seen by nuc-linkqual (~4.5 % avg, flat at ALL hours — the tell: no time-of-day pattern) tracked the stall rate, not network traffic, and only affected prod-originated media (stall → burst → drop), which is exactly the direction callers hear (MOH, greetings, agent audio).

Diagnosis (reusable; each test independently confirms)¶

Timer wakeup latency — THE test for this failure mode; healthy KVM p99 < 1 ms:

python3 -c "
import time
n=6000; d=[]
for i in range(n):
    t0=time.monotonic(); time.sleep(0.005); d.append((time.monotonic()-t0-0.005)*1000)
d.sort()
print(f'p50={d[n//2]:.2f}ms p95={d[int(n*.95)]:.2f}ms p99={d[int(n*.99)]:.2f}ms max={d[-1]:.2f}ms; >10ms: {sum(1 for x in d if x>10)}/{n}')"

Prod (broken): p99=32.68ms max=141.85ms; >10ms: 288/6000. Staging same DC (control): p99=1.59ms; 3/6000. A ~100× gap between two same-DC VMs = host node, not guest.

asterisk -rx "timing test" — broken: 18 of ~50 expected timerfd ticks/sec; healthy: 50/50.
External ping (independent clock, kernel-level reply — rules out guest measurement bias): from staging to prod on a 0.3 ms path, spikes to 120 ms; prod→1.1.1.1 (1.6 ms path) avg 23 ms / max 363 ms.
linkqual hourly profile — awk -F, '$2=="NUC"{h=substr($1,12,2); s[h]+=$4; c[h]++} END{...}' over /var/log/linkqual/*.csv: loss flat ~4.5 % across all 24h ⇒ not traffic-correlated.

Ruled out: CPU/RAM (idle, huge headroom), guest config (no cgroup quotas, THP, IRQ storms, D-state), NIC (0 drops), NAT/RTP config (correct), codecs (G.711 end-to-end, no transcoding; the V7 opus warnings ended Jun 3 with the codec_opus fix).

Fix¶

Contabo ticket for vmi3341755 with the evidence pack (Dump/contabo-ticket-evidence-vmi3341755-2026-06-10.txt in the dev workspace): the reproducible 30-second test, the same-DC control VM comparison, idle-guest proof, and an explicit ask to migrate the VM to a less-contended host node. Contabo live-migrated it within a day — uptime unbroken, all services survived. Verified after: timer 0/6000 >10ms, p99=1.11ms; Asterisk 50/50 ticks; prod→1.1.1.1 back to 1.9 ms ±0.5 ms.

Prevention rule¶

"VPS limited?" ≠ CPU/RAM count — check scheduling latency. When audio is choppy but every resource graph is green and steal=0, run the timer-wakeup test (above) before touching code or codecs; compare against a known-good VM in the same DC. Re-run it as the acceptance gate after any provider migration or new-VPS validation (criteria: p99 < 2 ms and timing test ≈ 50/50). Follow-up — DONE 2026-06-22: a host vCPU-scheduling probe is now built into nuc-linkqual. Each round it runs a load-free 2000-sample 5 ms-sleep wakeup test + worst-of-3 asterisk -rx "timing test", logs a TIMER row to /var/log/linkqual/linkqual-*.csv (ts,TIMER,localhost,p99_ms,max_ms,stalls>10ms,ticks), and ntfy-pages on p99 ≥ 2 ms OR ticks < 48 for ≥3 consecutive rounds (same hysteresis as the tunnel probes; tunables TIMER_P99_WARN / TIMER_TICKS_WARN / TIMER_BAD_THRESHOLD). So this failure mode now alerts instead of hiding behind steal=0 — no more waiting on a customer complaint. Local copy of the script: AstradialDevelopment/NUC/nuc-linkqual.sh.

Recurrence log — host-node steal RECURS; re-migration is temporary, dedicated CPU is the cure¶

The host-node contention comes back — a re-migration only moves you to a currently quieter shared host, and neighbours change. Timeline on the Mumbai box (82.180.146.80):

Date	timer p99	max spike	stalls >10ms	state
2026-06-10	32 ms	142 ms	4.8%	broken (original P1)
2026-06-11	1.1 ms	<10 ms	0%	after 1st Contabo re-migration ✅
2026-06-13	7 ms	30 ms	0.5%	drifted back within ~2 days ⚠️
2026-06-14	3–5 ms	40–43 ms	0.1–0.3%	after 2nd re-migration — improved, not pristine
2026-06-16	0.6–0.95 ms	3–15 ms	~0 (1/24000)	settled to near-dedicated ✅
2026-06-18	29 ms	111 ms	6.6% (396/6000)	regressed AGAIN — worse than the original P1. Business-hours pattern (clean overnight; 37 ms jitter at noon IST). 3rd ticket raised w/ evidence (`Dump/contabo-ticket-vmi3341755-2026-06-18.txt`)
2026-06-19	0.19 ms	1.2 ms	0%	3rd Contabo live-migration ✅ — verified clean through the noon-IST business-hours peak, under live calls (linkqual jitter 37 ms → 0.4 ms)
2026-06-21	0.19 ms	0.25 ms	0%	held 24h+ → acceptance met, ticket closed. linkqual jitter flat ~0.4 ms across 06-19/20/21 (first time it held past the ~2-day mark)

Lessons reinforced: - It recurs. The fix degraded within ~2 days the first time, then again on 2026-06-18 (3rd occurrence — worse than the original). Treat host re-migration as a reprieve, not a cure; re-run the timer test periodically — now automated: nuc-linkqual pages on it (see the Prevention-rule follow-up). The 2026-06-19 migration was the first to hold past 24h, but the permanent cure is still a dedicated VDS. - A fresh re-migration can land worse transiently — right after the 2nd migration the new host showed 40 ms spikes, then settled clean 2 days later as neighbours quieted. Re-test a day or two later, not just immediately. - Permanent fix = dedicated CPU (Contabo VDS in the same Mumbai region, or Linode-Mumbai / E2E dedicated). Pinned cores make steal structurally impossible. ⚠️ Verify the dedicated tier is in Mumbai — a dedicated VPS in the EU trades ~5 ms of steal-jitter for ~140 ms of distance (net catastrophic).

Latency vs jitter (the 2026-06-13 full-path audit — reusable)¶

When "high latency" is reported, decompose the path before blaming any one hop. The 2026-06-13 audit found the raw latency was fine (~17 ms NUC→Mumbai, direct route, no fragmentation, no transcoding, no Asterisk jitter buffer) — the problem was jitter, and it traced back to this same host-node steal. Exonerated, with measurements: - WireGuard adds ~0.1 ms latency and ~0.6 ms jitter (voice-payload test: 200 B @ 50 pps, WG path vs raw underlay to the same box; isolate WG's share as the delta). Under 50-call-equivalent load, no rise. Its crypto cost for 50 calls ≈ 0.25% of a core. WG is not a jitter/quality culprit — removing it gains 0.6 ms and loses encryption. - NUC is cleaner than the cloud box (timer p99 0.38 ms). Not the source. - The big jitter (mdev 4.4 ms, 54 ms spikes) was measured on the WG path but generated at the far end — the cloud VM's vCPU stalls make its RTP egress bursty; the tunnel just carries it. Jitter inflates the endpoint jitter buffer (~50–60 ms vs ~20–30 ms clean), which is real added mouth-to-ear delay — so killing the steal also recovers ~15–30 ms of effective delay, even though raw RTT is unchanged. Lead with "latency is fine, the problem is jitter" when explaining this.

Error 74: Persistent ~2 % loss + jitter on the NUC leg — USB WAN NIC at 1000/Half duplex + WiFi dual-homing ARP flux¶

Dates: found 2026-06-10 during the Error 73 audit; both faults resolved 2026-06-13 Severity: P2 — residual call roughness on all Tata/PSTN calls (the surviving cause after Error 73's fix) Fix (both done 2026-06-13): - USB WAN NIC — the flaky ASIX enx9c69d3197bc6 was replaced with a Realtek 2.5G (0bda:8156, iface enx80691adcd880). Live first-hop + tunnel tests now 0 % loss, no half-duplex. - WiFi dual-homing — disabled on the NUC (nmcli radio wifi off + rfkill block wifi, both persist across reboots). Default route is now the single dongle path only (wlo1 down, its metric-600 route gone). This also fixed the cloudflared SSH dropouts (inbound was randomly landing on the WiFi NIC) — live proof the flux was real. - Same session, related: the networking.service boot failure (duplicate enp86s0 defined inline in /etc/network/interfaces and in interfaces.d/enp86s0) was de-duplicated (kept the modular interfaces.d/ definition, no interface bounce — done with calls live); and the NUC timezone was corrected America/Chicago → Asia/Kolkata (logs were 10.5 h off IST and nearly caused a misdiagnosis in the Error 75 trace).

Symptom¶

After the Error 73 host migration cured prod, the prod→NUC tunnel still shows bursty loss (0–14 % depending on the window; ~1.7 % steady at the NUC's first hop) while RTT is now rock-steady. User history: "NUC internet used to be unstable."

Root cause (two compounding faults at the NUC site)¶

The USB WAN adapter (ASIX AX88179B, enx9c69d3197bc6) negotiated 1000 Mb/s HALF duplex with the router — not a real mode on modern gear; it means broken autoneg (adapter, cable, or router port). Result: collisions under simultaneous TX+RX (exactly RTP's pattern), 256 K+ RX drops (~1.5–2.7 %), 1.7 % loss to the router itself (3000-packet test), spikes to 86 ms. The Tata NNI on the built-in port (enp86s0) is pristine — only the broadband leg is hurt.
WiFi (wlo1) is up on the SAME subnet with a second default route (metrics 100 vs 600) → ARP flux: the router's ARP entry for the wired IP flaps to the WiFi MAC, so inbound traffic — including WireGuard call media — randomly arrives via WiFi (power-save latency spikes, 24 ms). Evidence: lifetime counters show WiFi RX 7.9 GiB vs TX 0.1 GiB (inbound landing on the wrong NIC), and tcpdump on wlo1 caught echo replies addressed to the wired IP.

Diagnosis (reusable)¶

ethtool enx9c69d3197bc6 | grep -E "Speed|Duplex"      # must say Duplex: Full
ping -q -i 0.02 -c 3000 192.168.0.1 | tail -2          # first-hop loss, unbound (default path)
cat /sys/class/net/enx9c69d3197bc6/statistics/rx_dropped  # sample twice, check growth
# ARP flux proof: replies to the WIRED IP arriving on WiFi
tcpdump -i wlo1 -c 50 -nn "icmp and dst host 192.168.0.14" &
ping -i 0.05 -c 200 -I enx9c69d3197bc6 192.168.0.1

⚠️ Bound-interface pings (-I) on a dual-homed LAN exaggerate loss (the 79 % seen was mostly flux artifact — replies landed on the other NIC). Trust the unbound test + drop counters + tcpdump, not -I loss numbers.

Fix (pending at site)¶

Swap the ethernet cable between dongle and router (most likely autoneg culprit), or another router port.
If still Half: replace the dongle with a Realtek-chip USB NIC (RTL8153-class, e.g. TP-Link UE306 / UGREEN USB-C).
Kill the dual-homing: disable WiFi, move it off-subnet, or set arp_ignore=1 + arp_announce=2 so inbound stays pinned to the wired NIC.
Acceptance: Duplex: Full, 0 % loss in the 3000-packet first-hop test, rx_dropped flat, no replies on wlo1 in the tcpdump test.

Prevention rule¶

On any media gateway, two interfaces on one subnet with two default routes is a standing fault — inbound path selection is left to router ARP timing. Keep exactly one L3 path per subnet (or pin ARP). And when a "flaky internet" report arrives, check ethtool duplex + first-hop loss before blaming the ISP: both NUC WAN faults lived inside the room. Bandwidth was irrelevant (~30 GiB/month, ~10 Mbps worst case) — never fix loss/jitter with a bigger plan.

Error 71: Self-call to a phone-forward (testing your own line from the forward SIM) rings but never arrives¶

Dates: 2026-06-13 (two-day debugging saga; first understood here) Severity: P3 — testing-only path; real-caller forwarding unaffected (verified live both days)

Superseded — this was NOT ultimately carrier-side

The conclusion below ("delivery ultimately at the carrier's discretion") was disproven later the same day. A side-by-side SIP diff showed the platform sends a byte-identical INVITE for a working click-to-call and the failing self-forward — the real cause was that the dialplan never answered the inbound leg, so the SIM stayed in "dialing-out" state and the network refused the second (call-waiting) call. It is fixable in the platform. See Error 76 for the root cause and the Answer() fix. Requirements 1–4 below are still useful background, but #3 (CID) and the carrier-discretion framing are wrong.

Symptom¶

The standard self-test of a forwarded line — call the org DID from the same mobile the line forwards to (e.g. dial 08065978006 from 9944421125 when ext 1008 forwards to 9944421125) — plays ringback in the caller's ear but the second call never presents on the handset, or the call drops immediately. Sometimes there is no CDR at all for an attempt.

The ringback proves nothing: it is early media (SIP 183) generated upstream, not evidence the handset is ringing.

The four independent requirements (ALL must hold)¶

A self-call is a second incoming call to a SIM that is busy placing the very call being tested. It only ever worked by an accident of CID handling, and it needs all four of:

#	Requirement	Where it lives	How it fails
1	Voice call waiting enabled on the SIM	carrier/handset — check `#43#`, enable `43#`	network never delivers a 2nd call to a busy line
2	Outbound trunk allows a 2nd concurrent channel (`max_channels > 1`)	trunk config (editor)	Asterisk declines the leg outright — BUSY, not call-waiting (Error 67)
3	Second call presents a number ≠ the dialled DID	dialplan generator (`_cidSetLine` self-call ExecIf, PR #454)	handset/network collapses "a second call from the number I'm already on a call with"
4	Attempts spaced ≥2 min apart	carrier behaviour	repeated short attempts trip call-waiting damping / spam suppression of the presenting number

How the CID requirement (#3) is met — and its history¶

_phoneForwardLeg / qm helpers emit a self-call-aware CID set:

exten => fwd_X,n,ExecIf($["${CALLERID(num):-10}"!="<phone10>"]?Set(CALLERID(num)=<orgDID>))

Real caller → org DID presented (the #442 / ACR fix — unchanged). Self-call (inbound CID's last 10 digits == dialled number) → caller's own CID kept → the NUC from-cloud gate (Error 33) substitutes the out-of-range CID with the sentinel pilot +918065978000 → the handset receives a call from a different number and can present it as call waiting.

Timeline that produced this understanding: pre-#442 code accidentally always sent the raw caller CID → pilot → self-calls worked 14/14 (CDR, Jun 7–9). #442 fixed real-caller CID but made the forward leg present the dialled DID → self-calls collided and died. #451/#454 added the ExecIf so both cases work. (A revert in between, #452/#453, was based on a misread — real forwards were never broken by the ExecIf; it is a no-op for any non-self caller.)

Diagnosis (reusable)¶

# 1. SIM: voice call waiting must say Enabled
#    (on the handset) dial *#43#

# 2. Trunk: max_channels must be >1 on the OUTBOUND trunk the forward uses
mysql -u root pbx_api_db -e "SELECT name,max_channels,trunk_type FROM sip_trunks WHERE org_id='<org>';"

# 3. Wire: capture the forward leg's From end-to-end
asterisk -rx "logger add channel sct notice,warning,error,verbose"; asterisk -rx "pjsip set logger on"
#    place the test call, then:
grep -A12 "INVITE sip:<mobile>@" /var/log/asterisk/sct | grep -E "INVITE|^From:|SIP/2.0"
#    prod fwd hop should show From=<caller> on a self-call; NUC log should show
#    "Outbound CID resolved: +918065978000". 183 then 487 with no 180/486 = the
#    mobile network swallowed delivery (carrier side, not platform).
asterisk -rx "pjsip set logger off"; asterisk -rx "logger remove channel sct"

# 4. History: did self-calls ever work? (they leave src==forward-target rows)
mysql -u root pbx_api_db -e "SELECT DATE(calldate) d, disposition, COUNT(*) FROM asterisk_cdr
  WHERE lastdata LIKE '%<mobile>%' AND lastapp='Dial' AND src LIKE '%<mobile>%'
  GROUP BY d, disposition ORDER BY d DESC;"

What this is NOT¶

Not packet loss / NUC / tunnel — verify with the SIP trace: if the INVITE reaches Tata and gets 183, the platform path is done and healthy.
Not proof forwarding is broken — test from any other phone; the forward will ring and show the org DID. A real-caller forward answering (e.g. CDR ANSWERED from a different src) clears the whole production path.
A no-CDR attempt usually means the call died before the org dialplan — check the NUC was actually up (see the NUC crash-reboot issue: 37 unclean reboots/30 days as of 2026-06-13; UPS pending).

Prevention rule¶

Never debug a forwarded line by calling it from the forward target's own SIM as the primary test — it stacks four failure modes (three of them outside the platform) on top of whatever you meant to test. Test forwards from a second phone first; treat the self-call as a convenience that needs requirements 1–4, with delivery ultimately at the carrier's discretion.

Error 75: NUC unexpected crash-reboots → intermittent "calls drop before reaching prod" (mains power, no UPS)¶

Dates: root-caused 2026-06-13 (heavy-rain power cuts; 2 crashes that night) Severity: P1 during each window — every inbound Tata call dies for ~1–2 min while the NUC reboots Fix: PENDING — install a line-interactive UPS with AVR at the NUC site (physical; awaiting hardware)

Symptom¶

Customer reports "I called the DID, heard ringing, the call dropped, and nothing appears in the call logs." Often intermittent, no pattern the operator can see. Real-caller forwards work fine moments before/after.

Why no CDR¶

A prod asterisk_cdr row is only created once the call reaches the org dialplan on the cloud. If the NUC is down (rebooting), the Tata INVITE hits a dead gateway → never forwards over WireGuard → no cloud CDR exists. So "not in call logs" == the call died upstream of prod (NUC or Tata), not in the app. The cloud is innocent.

Root cause¶

The NUC (ASUS NUC13ANHH5) was hard-resetting ~37 times in 30 days (~1/day), clustered 00:00–09:00 IST, on raw AC with no UPS. Indian grid sags/cuts — acute during rain — drop or brown-out the box; with no battery to ride through, it hard-resets. Each reboot = a 1–2 min total inbound outage.

Diagnosis (reusable — the tells that say "power", not software/hardware)¶

# 1. Reboot history — "crash" = unclean shutdown (no graceful poweroff before next boot)
last reboot -F | head -20          # many entries marked '- crash', clustered overnight

# 2. No kernel panic captured (points to power loss / hard reset, NOT a software panic)
ls /sys/fs/pstore/                 # empty after the crash
journalctl -b -1 | tail            # log just CUTS OFF mid-line, no shutdown sequence

# 3. Rule OUT the usual suspects:
for z in /sys/class/thermal/thermal_zone*/temp; do echo $(($(cat $z)/1000))C; done  # cool (~40C) => not thermal
free -h                            # RAM free => not OOM
systemctl show asterisk -p NRestarts   # 0 => systemd didn't crash-loop it; the MACHINE rebooted
ls /sys/class/power_supply/        # no battery/UPS device => running on raw AC

⚠️ Two red herrings on the NUC13: - EDAC IBECC MEMORY ERROR in the kernel log looks alarming but is the cosmetic igen6_edac driver false-positive on Intel 12/13th-gen NUCs — fires only at boot, exactly once per memory controller, with an undecodable all-1s address 0x1ffffffffff, and there are zero real MCEs (journalctl -k | grep -i mce). Do not buy RAM over it. - The NUC clock was on America/Chicago (CDT) → log timestamps were 10.5 h off IST; a line stamped "16:56" was really the 03:26 IST call, nearly causing a wrong "call never reached prod" conclusion. Fixed (timedatectl set-timezone Asia/Kolkata); see Error 74.

How to confirm it's power (not the link)¶

When the NUC is up, the whole call path is healthy — from prod: wg show wg0 recent handshake, ping 10.10.10.2 0% loss, pjsip show endpoint tata_gateway Avail, and the NUC log shows every Tata INVITE forwarding to cloud-endpoint and ringing. The drops correlate only with the reboot windows. The external Healthchecks.io dead-man's-switch (nuc-heartbeat) records the exact down windows — cross-check them against the customer's "dropped call" times.

Fix¶

Primary: line-interactive UPS with AVR (AVR matters more than battery here — it corrects the voltage sags that reset the box without a full outage; battery rides the full cuts). ~600–1000 VA is ample for a NUC (~20–40 W). PENDING hardware install.
Then: a UPS with USB data + NUT lets the NUC shut down gracefully on low battery during long outages — turns a hard crash into a clean poweroff (protects the filesystem). Offer to wire up NUT after the UPS is in.
Honest limit: no small UPS survives a multi-hour cut (like today's rain) — during genuinely long outages the gateway will go down; that needs a generator/large battery, rarely worth it for one gateway. The UPS's job is to kill the frequent short sags/dips (the bulk of the 37 crashes), not to survive every outage.

Prevention rule¶

A media gateway on raw AC in India is a standing P1 risk — UPS+AVR is not optional. And when a "calls drop, nothing in logs" report arrives, check NUC uptime / last reboot FIRST — no CDR = the call died before the cloud, and an overnight crash-reboot is the prime suspect.

Error 76: Self-call phone-forward rings but never connects — inbound leg never answered (call-waiting can't engage)¶

Dates: root-caused + fixed 2026-06-13 (supersedes the carrier-side conclusion in Error 71) Severity: P3 — owner self-test path; real-caller forwarding always worked Fix: PR #456 (direct-dial self-call branch) + PR #457 (the actual fix: Answer() the inbound leg) — staging-verified, prod promotion pending

Symptom¶

Dial the org DID from the very SIM the line forwards to (e.g. 9944421125 dials +91 80659 78003, where ext 1004 forwards to 9944421125). The caller hears ringback but the handset never rings. Meanwhile click-to-call to the same mobile works — including as a "second call" when the SIM is already on an answered call. CDR for the forward leg: NO ANSWER, billsec 0.

Root cause¶

A self-call's forward leg is a second call to a SIM that is mid-dial-out on the call that triggered it. Mobile call-waiting only delivers a 2nd call to a line in an active (answered) call — not one that is still dialing/alerting.

The inbound leg (SIM → DID) was never answered: the generated ext_<org>.conf for a ring_target=phone user went straight from setup to Dial(<forward>) with no Answer(). So the SIM's outbound call to the DID only ever reached early media (183), never 200 OK — it stayed in "dialing-out" state, and the network refused the call-waiting leg.

How it was proven (the method that mattered)¶

A side-by-side full-SIP diff at the NUC→Tata hop (10.79.215.102), capturing a working click-to-call and the failing self-forward to the same number:

The INVITE to Tata is byte-identical on both: same From: +918065978001, same To: 09944421125, same SDP (PCMA/alaw). So it is NOT the caller ID, the 0/91/+91 prefix (single Dial(PJSIP/0${EXTEN}@tata-endpoint,…) line on the NUC), the Local CID hop, the ring timeout, or any header (no Diversion/History-Info either side).
The only difference is Tata's response: click-to-call → 180 Ringing; self-forward → 183 then 487, never 180. 180 is emitted by the mobile network only when it actually rings the handset.
The staging trace also showed the inbound leg stuck at 183, never 200 — confirming the SIM was never put into an active call.

Lesson: for any "works in path A, fails in path B" telephony bug, capture full SIP on both paths and diff them FIRST. The signal here was in the response (180 vs none) and the inbound answer state, not the request — several hours were lost guessing request-side variables (CID, prefix, Local hop, timeout) that the diff would have ruled out immediately.

Diagnosis¶

# CDR: forward self-call rows are src==dst==mobile, lastapp=Dial. billsec 0 = never answered.
mysql … -e "SELECT calldate,dst,disposition,billsec,LEFT(lastdata,40)
  FROM asterisk_cdr WHERE src LIKE '%<mobile>%' AND lastdata LIKE '%<mobile>%'
  AND lastapp='Dial' ORDER BY calldate DESC LIMIT 10;"

# SIP diff at NUC→Tata: does the forward leg get 180, or only 183 then 487?
asterisk -rx "logger add channel sct notice,warning,verbose"; asterisk -rx "pjsip set logger host 10.79.215.102"
# place a click-to-call AND a self-call, then compare the INVITEs + response codes
asterisk -rx "pjsip set logger off"; asterisk -rx "logger remove channel sct"

# Live dialplan: is there an Answer() before the forward Dial?  (there should be, on the self-call branch)
asterisk -rx "dialplan show <ext>@org_<prefix>__internal" | grep -iE "Answer|Dial\(|GotoIf"

Fix¶

In api/src/services/asterisk/dialplanGenerator.js (generateUserExtension, ring_target='phone' path): branch on the self-call trigger (inbound CID last-10 == dial target — the same test as _cidSetLine). On a self-call, Answer() then Wait(1) the inbound leg before dialing the forward directly (no Local CID hop — a self-call keeps the caller's own CID anyway):

exten => <ext>,n,GotoIf($["${CALLERID(num):-10}"="<mobile>"]?<ext>_selfcall)
exten => <ext>,n,Dial(Local/fwd_<ext>@…/n,30,tT)          ; normal: Local CID hop (DID-on-From, Error 64)
exten => <ext>,n,Goto(<ext>_fwddone)
exten => <ext>,n(<ext>_selfcall),Answer()                  ; 200 OK → SIM's outbound call CONNECTS
exten => <ext>,n,Wait(1)                                   ; let the 200 reach the SIM first
exten => <ext>,n,Set(OB_EP=…)/Set(OB_PFX=…)
exten => <ext>,n,Dial(PJSIP/${OB_PFX}<mobile>@${OB_EP},30,tT)   ; direct trunk dial
exten => <ext>,n(<ext>_fwddone),NoOp(…DIALSTATUS=${DIALSTATUS})

The 200 OK propagates back to the SIM → its outbound call to the DID becomes connected → the SIM is now in an active call → the forward arrives as a proper call-waiting call → 180 → it rings.

Gated to self-calls only. Normal callers (a different, idle SIM) keep the unchanged Local-hop path — answering early would wrongly flip their missed-call / billing semantics. Applied by the generator to every ring_target=phone user extension automatically.

Known remaining gaps¶

Queue-member (qm) and failover phone legs still route through the Local hop without the self-call Answer() — same latent self-call quirk, left untouched to keep the blast radius minimal. Extend the same pattern there if a self-test through a queue/failover is ever needed.
Staging-only quirk that masked the CID during debugging: staging outbound egresses via prod-cloud ext_from_cloud.conf, which hardcodes the From to +918065978001 ("AstraPrivate") regardless of what the org sets — see Outbound Caller ID resolution. Real per-call CID is presented on prod, not on staging.

Prevention rule¶

A phone-forward leg that must ring a mobile already engaged on the triggering call (only the self-test case) needs the inbound leg answered first so the mobile is in an active call and call-waiting can engage. More broadly: when one call path works and a "similar" one doesn't, diff the full SIP of both at the same hop before forming any hypothesis — and remember 180 Ringing comes from the carrier, so its presence/absence is the cleanest signal of whether the handset is actually being alerted.

Error 77: All Upptime status-page workflows fail in ~10 s — node-libcurl native ABI vs the GitHub runner's default Node¶

Dates: root-caused + fixed 2026-06-21 Severity: P4 — affects only the status page (status.astradial.com); no call-path or customer impact. The visible symptom is a flood of GitHub "workflow failed" emails plus a stale status page. Fix: astradial/uptime-monitor rebuilt for Node 24 → tag v1.41.4-astradial; all 5 Upptime workflows repointed to it.

Symptom¶

Every scheduled Upptime workflow — Uptime CI, Graphs CI, Response Time CI, Static Site CI, Summary CI — fails within ~10 s, on every 5-minute cron, so GitHub emails a failure each time and the status page stops updating. The monitored services themselves are fine; only the monitor's own automation is broken. All workflows failing fast and together = systemic, not a monitored site being down.

Root cause¶

GitHub flipped the Actions runner's default Node from 20 → 24 (the Node 20 deprecation). The custom action astradial/uptime-monitor bundles a precompiled native addon — node-libcurl (dist/lib/binding/node_libcurl.node) — built for Node 20's ABI. Under Node 24 it can't load:

Error: ... node_libcurl.node was compiled against a different Node.js version using
NODE_MODULE_VERSION 115 (Node 20). This version of Node.js requires NODE_MODULE_VERSION 137 (Node 24).

The action crashes on load, so every workflow that uses: it dies immediately. This recurs every time GitHub advances the runner's default Node past the version the bundled binary was built for — treat it as a maintenance event, not a one-off.

Diagnosis¶

# All workflows red + failing fast (~10s) = systemic:
gh run list --repo astradial/upptime --limit 20

# The actual error — look for the NODE_MODULE_VERSION mismatch on node_libcurl.node:
gh run view <run-id> --repo astradial/upptime --log-failed | grep -A3 NODE_MODULE_VERSION

# What Node the action declares vs what the runner now forces:
gh api "repos/astradial/uptime-monitor/contents/action.yml?ref=<tag>" \
  -H "Accept: application/vnd.github.raw" | grep using:

Fix (durable)¶

The action is a fork (astradial/uptime-monitor, default branch master) whose build.yml builds dist/ and commits it back. The bundled binary must be linux-x64 of the runner's current ABI, so it has to be built on Linux — never on macOS (that bundles a darwin binary that won't run on ubuntu-latest). Steps that produced v1.41.4-astradial:

package.json: bump node-libcurl to a version that ships a prebuild for the new ABI (4.1.0 → 5.1.2). Confirm the asset exists first, else the build falls back to a slow/fragile source compile:
```
gh api repos/JCMais/node-libcurl/releases/tags/v5.1.2 --jq '.assets[].name' | grep node-v137.*linux-x64
```
action.yml: using: node20 → node24. .github/workflows/build.yml: node-version: 20 → 24.
tsconfig.json: add "skipLibCheck": true — node-libcurl 5's .d.ts trips the old @types/node (error TS2315: Type 'Buffer' is not generic), which would otherwise break the fork's CI build.

Rebuild dist/ in a real Linux Node-24 container and commit it (the fork's CI has no GH_PAT secret, so its auto-commit would fail — build locally instead):

docker run --rm -v "$PWD":/app -w /app node:24-bookworm \
  bash -lc 'npm install && npm run build && npm run package'
file dist/lib/binding/node_libcurl.node   # must be: ELF 64-bit x86-64 (GNU/Linux)

Commit to master with [skip ci], tag v1.41.4-astradial, push.
Repoint all 5 workflows in astradial/upptime (uptime/graphs/response-time/site/summary.yml) from the old tag to the new one. Test one (response-time) green before rolling out the rest — and don't repoint until the tag is tested, so the (already-broken) page is never made worse.

Stopgap (temporary, if you need it green immediately): add env: { ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: "true" } to the workflows to force the action back onto Node 20. Unblocks at once, but the escape hatch disappears when GitHub fully removes Node 20 — it is not a substitute for the rebuild.

Prevention rule¶

Any GitHub Action shipping a precompiled native module (here node-libcurl) breaks when the runner's default Node advances past the version its dist/ was built for. When that happens: rebuild dist/ on the matching runner OS/arch (linux-x64 — in CI or a node:NN-bookworm container, never macOS), bump the native dep to a version that publishes a prebuild for the new ABI, set both action.yml using: and the build CI node-version: to the new major, then re-tag and repoint. The NODE_MODULE_VERSION X vs Y line in the failed log names exactly which ABI you need (115 = Node 20, 127 = Node 22, 137 = Node 24).

Error 78: Killed-app iOS calls route to phone failover — wake CURL regenerated AFTER the DEVSTATE check¶

Date: 2026-07-01 → fixed 2026-07-03 (PR #494) · Env: staging

Symptom: killed iOS app stops receiving calls; CDRs show 1001→1005 with dstchannel=Local/fwd_fo_1005@… (phone-number failover) instead of the app ringing. Worked the previous day.

Root cause: the Jun-30 E2E-validated dialplan design (wake CURL before Set(DEVSTATE…)) existed only in an uncommitted working tree. The next deploy's regen-all-org-configs step (Rule 7) regenerated every org from the committed generator, which still emitted the wake inside (available) — i.e. after the DEVSTATE branch. A killed app has no PJSIP contact → DEVSTATE=UNAVAILABLE → (unreachable)/failover, and the push never fires.

Fix: dialplanGenerator.js emits the wake block before the DEVSTATE read (PR #494); unit test W1 locks the ordering.

Prevention: never leave live-tested VPS behaviour uncommitted — the deploy regen silently reverts it (same lesson as the VPS-drift rules in the repo CLAUDE.md).

Error 79: First wake call races to failover although the app re-registered — DEVSTATE is stale for ~2s after REGISTER¶

Date: 2026-07-03 (fixed same day, PR #495) · Env: staging

Symptom: killed app: first call rings the app and forwards to the failover number; app shows a nameless call that dies. Second call works.

Root cause: wake endpoint confirmed the re-REGISTER in ~2s, but the next dialplan step read DEVICE_STATE — still UNAVAILABLE because qualify hadn't completed for the 1-second-old contact — and branched to failover.

Fix: the wake CURL result (plain=1 → yes/no) is stored in WAKERES and GotoIf($["${WAKERES}"="yes"]?available) dials immediately, skipping the stale DEVSTATE read. Test W4.

Error 80: Dead/offline phone rings 30s of dead air and failover never fires — plus the general fix (wake decision v2)¶

Date: 2026-07-03 (PRs #496 #497) · Env: staging

Symptom: switch the iPhone off (or drop its internet): calls to it ring ~30s then announce "not available" — the configured phone-number failover never triggers. Registrar shows the contact as reachable for minutes.

Root causes (three stacked): 1. Registration expiry 3600s — a dead device stays "registered" up to an hour. 2. Qualify only pings every 60s — up to a minute of stale "reachable". 3. The wake endpoint trusted the cached status and answered yes.

Fix (wake decision v2): - maximum_expiration=300 + minimum_expiration=30 on every user AOR; the app has a Settings picker (30s/1m/2m/5m) that forces an immediate re-REGISTER. - The wake endpoint now always sends the push (with a per-call push-id) and decides on live evidence, hidden behind Ringing(): qualify-verified → dial (~3.5s) · app ack → wait for its re-REGISTER → dial · neither → failover in ~5-6s. The app's instant POST to /api/v1/users/voip-push-ack is the delivery receipt — APNs provides none (its 200 = accepted by Apple, not delivered).

Debug: api logs show the decision reason (verified / acked / no-ack / push-4xx); dialplan logs show VoIP wake: yes|no.

Error 81: Fake incoming call when the phone comes back online¶

Date: 2026-07-03 (PR #497) · Env: staging

Symptom: phone offline during an inbound call (which correctly failed over); when its internet returns — minutes later — the app rings with a call that doesn't exist.

Root cause: the VoIP push carried no apns-expiration, so Apple queued it for the offline device and delivered it on reconnect; iOS 13+ forces the app to report a CallKit call for every VoIP push → ghost ring.

Fix: apns-expiration: 0 (deliver now or drop) in apnsService.js.

Error 82: "Wrong username or password" toast on every iOS app open¶

Date: 2026-07-03 (app commit, DEF-008) · Env: app (all)

Symptom: every cold open shows the credentials-error toast although the SIP line registers fine seconds later.

Root cause (two layers): upstream Linphone code force-enables presence PUBLISH after each successful registration — Asterisk has no presence publish handler (res_pjsip_pubsub: No registered publish handler for event presence, 489) — and the resulting transient Failed state hit an unscoped toast in CoreContext.onAccountRegistrationStateChanged that fires for ANY non-Ok state while the network is up.

Fix: presence PUBLISH kept disabled (Asterisk doesn't support it), and the toast now requires state == .Failed && interactiveLoginPending — i.e. only a login the user just attempted; background re-register blips stay silent (the status chip already shows red/green).

Error 83: iPhone hangup doesn't clear the far-side softphone — third-party client ignores BYE (Telephone 1.6)¶

Date: 2026-06-30 → verdict 2026-07-03 · Env: staging, macOS "Telephone" 1.6

Symptom: app-side hangup ends the call everywhere except the caller's softphone UI, which stays "in call".

Diagnosis (packet-proven): Asterisk hangup handler shows SOURCE=PJSIP/org_mna9x47k__1005, CAUSE=16 (iPhone sent BYE, far leg torn down) on 4/4 calls; pjsip set logger on shows the BYE to the Mac's contact retransmitted 11× with no 200 OK, while qualify OPTIONS to the same addr:port round-trip fine. The client (User-Agent Telephone 1.6) receives and ignores the BYE — our stack is correct.

Fix: replace Telephone 1.6 with a modern client (Linphone desktop / Zoiper) for test setups. Nothing to change in Astradial.

Error 84: Editor dashboard shows Cloudflare 502 body + "No organizations found" — pipecat gateway down (missing MariaDB user)¶

Date: 2026-07-04 (aborted the France→Mumbai cutover, Attempt 1) · Env: New Prod 147.93.168.216

Symptom: editor.astradial.com/dashboard renders raw Cloudflare 502 HTML at the top of the page and the org list is empty ("No organizations found"), even though the astrapbx API is healthy (/health 200) and the DB has all orgs. Reads as "prod is down".

Why the editor breaks when the gateway is down: nginx maps editor.astradial.com/api/gateway/* → 127.0.0.1:7860 (pipecat gateway). Gateway dead → those subrequests 502 → the dashboard injects the error body and the org list never renders. The editor dashboard hard-depends on pipecat — it is NOT just "AI bots".

Diagnosis chain (each layer masked the next — peel in this order):

systemctl status pipecat-flow            # crash-looping? (NRestarts high)
cd /opt/pipecat-flow && .venv/bin/python -c "from gateway.main import app"  # real import error
journalctl -u pipecat-flow -n 300 | grep -vE "merged_lifespan|anext|contextlib"  # real startup error

Wrong Python: venv must be 3.11.15 via uv (uv sync --python 3.11, identical uv.lock as France). py3.13 → llvmlite build fail; py3.12 also wrong.
NLTK punkt_tab BadZipFile race: pipecat imports nltk punkt_tab at module load; 8 gunicorn workers all missing it download concurrently and trample /root/nltk_data/tokenizers/punkt_tab.zip → BadZipFile. Verify: python -c "import zipfile;print(zipfile.is_zipfile('/root/nltk_data/tokenizers/punkt_tab.zip'))". Fix: delete + re-download once (single process) before starting the service.
Root cause — missing DB user: init_db() → aiomysql → ERROR 1045 Access denied for 'pipecat'@'localhost'. The box's MariaDB was provisioned with only pbx_api; the gateway's pipecat user was never created. Compare: mysql -N -e "SELECT User,Host FROM mysql.user" against the reference box. Fix: replicate the user from the working box via SHOW CREATE USER 'pipecat'@'localhost' + SHOW GRANTS (password hash — no plaintext handling).

Prevention rule: a server restore/rebuild is NOT complete until you diff the config that lives outside dumps and images against the working reference: mysql.user, PostgreSQL roles, per-app .env key sets, nginx vhosts, systemd units. And before any cutover: login to the editor dashboard on the preview hostname and see the orgs render — that one E2E gate catches this entire class.

Full cutover post-mortem: Prod Cutover — France → Mumbai.

Error 85: Call recording 404 — "No recording legs found on disk or storage" (ffmpeg missing)¶

Date: 2026-07-04 (first post-cutover day on New Prod) · Env: prod 147.93.168.216

Symptom: editor Play button → NotSupportedError: no supported source; the API GET /api/pbx/calls/:id/recording?token=… returns 404 with body {"error":"No recording legs found on disk or storage"} — even though the WAV exists at the exact DB-referenced path in /var/spool/asterisk/monitor/.

How the handler works (server.js GET /api/v1/calls/:callId/recording): anchor lookup (call_records by id+org, fallback asterisk_cdr id+accountcode) → gather sibling legs by linkedid → for each leg resolveLocal() (monitor → alt dir → rclone fetch from firebase: remote) → probeDuration() via ffprobe → stitch multi-leg via ffmpeg. A leg that fails probeDuration is dropped; zero probed legs → this exact 404, even when files exist.

Root causes (three restore gaps, one incident): 1. ffmpeg/ffprobe not installed → every leg failed duration probing → 404. (apt-get install -y ffmpeg) 2. /root/.config/gcloud/application_default_credentials.json missing → the hourly move-recordings.sh Firebase upload was failing silently → 384-file local backlog. 3. rclone firebase: remote absent from /root/.config/rclone/rclone.conf (only gcs: existed) → API's Firebase fetch fallback + mover both broken.

Fix: install ffmpeg; copy the ADC json + the [firebase] rclone stanza from the reference box; re-run move-recordings.sh to drain the backlog. Verified: the exact failing request then streamed 200 WAVE audio.

Prevention rule: restore/rebuild parity must include dpkg -l package diff against the reference box (this caught ffmpeg; the same sweep-class earlier caught missing watchdog scripts) plus ~/.config/{rclone,gcloud} — credentials and CLI remotes live outside images, dumps, AND /opt.

Error 86: Live Calls page — From == To, CallerID == dialed number, calls flicker/linger¶

Date: 2026-07-06/07 · Env: staging (code shipped to prod later via promote PR)

Symptoms: Live Calls rows showing From identical to To (agent's own extension both sides), CallerID column showing the dialed number instead of the DID, "Internal" badge on inbound/outbound calls, rows flickering or lingering ~3s after hangup; values self-corrected a few seconds into a call.

Diagnosis that worked: passive AMI event capture on staging + replaying the events through liveCallsService locally — reproduced the exact wrong rows deterministically. Do this before touching display logic; see Live Calls.

Root causes (several, one family): 1. Per-request sendAction('CoreShowChannels') resolved on the FIRST TCP chunk → randomly truncated/empty channel lists (the frontend "anti-flicker" hack papered over this). 2. Trunk endpoints named org_<prefix>_tata/tata_gateway — direction test looked only for "trunk", so trunk legs classed "internal" and the outbound enrichment overwrote From with the agent extension. 3. Trunk leg's CallerIDNum is the DIALED party; the presented DID is stamped on the CALLER leg via NewCallerid — reading the trunk leg showed CallerID == To. 4. Gateway caller legs are org-invisible at ring time (shared context, empty accountcode) until NewAccountCode fires.

Fix (PRs #505/#506): event-driven in-memory state on the persistent AMI connection + SSE push; TRUNK_LEG_RE; CID from the caller leg; AccountCode org attribution; buffered coreShowChannels().

Prevention: the six verified AMI facts in Live Calls — check any new display heuristic against them; regression tests api/tests/live-calls-service.test.js are built from the captured event sequences.

Error 87: Call Logs — outbound shows From = org's own DID, direction "Internal"¶

Date: 2026-07-07 · Env: staging

Symptoms: Call Logs rows for outbound calls show From = the org DID (resolved to the DID's contact name, e.g. "In use on production") and direction "Internal"; inbound rows show To = the DID instead of the extension that rang.

Diagnosis: ran the production direction CASE from api/src/lib/callDirection.js directly against asterisk_cdr on two ground-truth linkedids → outbound row classified internal; 216 calls misclassified in 30 days. CDR facts: outbound src = the PRESENTED CID (dialplan stamps it before dialing), caller extension is in channel; lastdata='PJSIP/<num>@org_<prefix>_tata,…' fails the %@%trunk% pattern. Empirical guardrail: a bare %@% classifier would misfire on Local dials targeting dialplan contexts (117 rows/90d) — verified 0 false positives with name-based patterns (trunk|tata|gateway).

Fix (PR #507): trunkDial() in callDirection.js (single source for /calls, /export, /history, /stats); shared fromNumberCaseSql (extension from t.channel for outbound) and toNumberInboundExtWhenSql (rung extension from t.dstchannel for inbound); pollCdr's JS heuristic aligned (feeds call.ended webhooks + auto-tickets). Tests: call-direction.test.js CD1–CD6, sql-invariants S17/S18.

Prevention: direction/from/to logic exists in exactly three places — callDirection.js (SQL), liveCallsService.js (live view), pollCdr (JS copy). Change trunk-name assumptions in ALL of them together.

Error 88: Staging disk 63% full — 68 GB of dead Asterisk debug logs (no rotation)¶

Date: 2026-07-07 · Env: staging (prod guarded same day)

Symptoms: the new admin Resource Usage card showed Disk 60% (118.9/196.7 GB) on staging. du walk: /var/log = 70 GB, of which /var/log/asterisk = 67 GB — two 34 GB files (full.log.0, messages.log.0) rotated out on 2026-05-31 and never touched since.

Root cause: Asterisk's full.log/messages.log have no size-based rotation by default; staging logs verbosely (AMI capture sessions). A one-off logger rotate on May 31 parked the giants as .log.0 and the active logs kept growing.

Fix: lsof both files (nothing held them) → rm → disk 63% → 28%. Then /etc/logrotate.d/asterisk on BOTH staging and prod: size 500M, rotate 4, compress, delaycompress, postrotate asterisk -rx "logger reload". Validated with logrotate -d.

Call-data note: these logs are dialplan execution traces only — CDR lives in MySQL asterisk_cdr, recordings in /var/spool/asterisk/monitor (383 MB). Nothing call-related was deleted.

Prod caveat: prod's full/messages.log.0 (1.1 GB each, rotated Jul 4 during the Mumbai cutover) were deliberately KEPT as cutover forensics while the V7 router re-handshake and ATA repoints are outstanding — delete once those close.

Prevention: the logrotate stanza caps growth at ~2.5 GB per log family. The dashboard Disk tile reads the whole root filesystem, so it doubles as the early-warning for any future disk eater (see features/org-resource-usage.md).

Error 89: V7 tunnel ntfy spam — restored prod ran a stale France-era nuc-linkqual (no NOALERT, no TIMER probe)¶

Date: 2026-07-07 · Env: prod 147.93.168.216

Symptoms: constant 🟠/🟢 "V7 tunnel degraded/recovered" ntfy pushes, even though monitoring.md documents V7 as "logged only (no page)".

Root cause: the 2026-07-04 Mumbai restore copied watchdog scripts from the frozen France box, and /usr/local/bin/nuc-linkqual.sh was the stale pre-tuning version: no NOALERT_LABELS knob, trigger-happy thresholds (2% loss / 60 ms jitter / 3 rounds vs the tuned 15% / 120 ms / 6), a hard-coded ntfy topic, and no TIMER/Error-73 vCPU probe at all — so the Error 73 early warning had been silently missing since the restore. Docs described the new script; prod ran the old one.

Diagnosis:

ssh root@147.93.168.216 'grep -n NOALERT /usr/local/bin/nuc-linkqual.sh'   # no hit = stale
diff <(ssh root@147.93.168.216 cat /usr/local/bin/nuc-linkqual.sh) \
     internal-docs/scripts/nuc-linkqual.sh

Fix: backup (nuc-linkqual.sh.bak-2026-07-07), deploy the repo copy internal-docs/scripts/nuc-linkqual.sh → /usr/local/bin/, systemctl restart nuc-linkqual. Verified: CSV logs NUC + V7 + TIMER rows; V7 at 100% loss (wg1 re-handshake still outstanding) produced no page; TIMER read healthy (p99 0.09 ms, 50/50 ticks); nuc-watchdog untouched.

Un-pause V7 later: set NOALERT_LABELS="" in /etc/nuc-watchdog.env and restart the service.

Prevention: after any VPS restore, diff every /usr/local/bin watchdog script against internal-docs/scripts/ — a restore resurrects whatever the image had, not what the docs describe.

Error 90: SSH to prod times out while ping works — asterisk-pjsip fail2ban ban drops ALL ports¶

Date: 2026-07-07 · Env: prod 147.93.168.216

Symptoms: ssh root@147.93.168.216 times out from one machine; ICMP ping is fine (INPUT policy DROP doesn't cover the established ICMP path); staging SSH works. Looks like an sshd ban, but fail2ban-client status sshd doesn't list the IP.

Root cause: the asterisk-pjsip jail banned the office/home public IP because a local SIP device (softphone/ATA with stale credentials) was repeatedly failing auth against prod. Fail2ban's default iptables action inserts an all-ports DROP, so a SIP-triggered ban takes out SSH too.

Diagnosis / fix (from a non-banned host — staging works as a jump):

ssh -J root@94.136.188.221 root@147.93.168.216 \
  'grep <your-ip> /var/log/fail2ban.log | tail; fail2ban-client unban <your-ip>'

Note: staging has no key for prod — use -J (ProxyJump) so your local key authenticates while the TCP connection originates from staging.

Prevention: fix or decommission the SIP device doing the failing registrations, or add the trusted IP to ignoreip (see fail2ban runbook). The runbook's jail table predates the asterisk-pjsip jail — the same unban/ignoreip commands apply with that jail name.

Error 91: Android softphone "no audio" on the emulator — arm-on-x86 libsrtp translation artifact¶

Date: 2026-07-09 · Env: Android app (own AstraSIP+WebRTC engine) on an x86_64 emulator

Symptoms: call connects, signaling clean, but zero / one-way audio; logcat shows Failed to init SRTP err=11 and RTP packets dropped. Looks like a media-engine bug in the app.

Root cause: the debug APK shipped arm-only ABIs, so on an x86_64 emulator it ran arm64 under binary translation — and libsrtp fails to initialize under translation, dropping every RTP packet. Not an engine bug. Run the app natively (add x86_64 to the debug abiFilters) and audio is correct: staging *43 echo → out 600 / in 586 pkts, ICE CONNECTED, DTLS done, zero SRTP errors. The emulator cannot validate media unless the ABI matches the host — this is the classic RCA-rule trap (an environment artifact, not the code). Related: the emulator mic defaults to off in AVD settings → also reads as "no audio". Confirm audio on a real arm64 device, never the emulator.

Secondary (same session): a backgrounded call that reached Established crashed via ForegroundServiceStartNotAllowedException when MicForegroundService.start() fired while the app wasn't foreground — killing the process. A killed process sends no SIP BYE, so with Echo() (no RTP timeout) the channel lived forever; repeated crashed test-dials piled up orphaned *43 channels. Fixed by guarding the FGS start (b4f4a4a84). Belt-and-braces for any hard-kill (swipe-away/OOM) orphan: set rtp_timeout on the webrtc endpoint so client-death channels self-reap server-side.

Error 92: Mobile app can't register on PROD — no WSS listener (prod restored from pre-WSS image)¶

Date: 2026-07-09 · Env: prod 147.93.168.216

Symptoms: the AstraCall mobile engine (iOS + Android) registers fine on staging wss://stagesip:8089/ws but wss://devsip.astradial.com:8089/ws refuses connections on prod. The iOS App Store build points at prod devsip, so this silently breaks the launch.

Root cause: prod was restored from the old Mumbai restic image that predates WSS — prod http.conf was bindaddr=127.0.0.1 with no TLS listener and there was no Let's Encrypt cert for devsip. Staging got WSS during iOS engine testing; prod never did.

Fix (all additive — Asterisk NOT restarted, no calls dropped, nginx untouched): 1. certbot LE cert for devsip.astradial.com via webroot -w /var/www/html (nginx catch-all already serves :80 from there → zero nginx change). 2. Append a TLS listener to /etc/asterisk/http.conf [general] (keep bindaddr=127.0.0.1; add tlsenable=yes, tlsbindaddr=0.0.0.0:8089, tlscertfile/tlsprivatekey). Backup: http.conf.bak-wss-2026-07-09. 3. KEY GOTCHA: prod Asterisk runs as the asterisk user (staging runs as root); LE /etc/letsencrypt/{live,archive} are root-only 700, so asterisk can't read the key → TLS silently doesn't start. Copy cert+key to /etc/asterisk/keys/devsip-*.pem (chown asterisk, chmod 600), point http.conf there, core reload → HTTPS Server Bound to 0.0.0.0:8089, /ws → 426, cert publicly trusted. 4. Renewal durability: /etc/letsencrypt/renewal-hooks/deploy/astradial-asterisk-devsip.sh re-copies + chowns the cert and core reloads on every renewal (LE live dir would revert to root-only otherwise). Dry-run passed.

Note: a mobile endpoint must also be sip_port='wss' (transport=ws + webrtc=yes) to actually place a WebRTC call — see the Android/iOS status docs.

Error 93: Analytics call counts dropped after 2026-07-11 — now PSTN-only (expected)¶

Symptom: dashboard/app "total calls", "answered", "missed", "avg duration" show lower numbers than before, for the same window.

Cause (not a bug): as of 2026-07-11, the analytics headline metrics count PSTN calls only — internal ext-to-ext calls are excluded because they inflated the numbers that operators read as customer traffic. Two places changed, both keyed off callDirection.js isInternal / the row direction field: - Backend GET /api/v1/calls/stats (editor dashboard): total_calls/answered/missed/ avg_duration use per-metric CASE WHEN NOT (isInternal) …. The inbound/outbound/internal breakdown series + the weekly chart are UNCHANGED — the internal split still shows real counts; only the top-line numbers went PSTN-only. - iOS Pulse (computes locally from /api/v1/calls): PulseSnapshot.compute filters out direction == "internal" rows. Recents/call-log screens are untouched.

Verify: headline totals should now equal inbound + outbound (not + internal). If a number looks wrong, check the row's direction classification in callDirection.js.

Error 94: iOS in-call screen shows "Calling…" with no number¶

Symptom: on the AstraCall in-call screen the caller/callee identity is blank — only the static "Calling…" / "Incoming Call…" label shows. (The native CallKit banner was fine.)

Cause: the app's own in-call view (NativeCallUI) reads CallViewModel.displayName, which is fed only by AstraCallManager.onCallerName — a callback that was declared and consumed but never invoked. So the name stayed empty.

Fix (2026-07-11): AstraCallManager.call(_:didChange:) now fires onCallerName with the cleaned remote party — outbound remoteParty is a raw sip:1005@host:port URI, stripped to the user part; inbound is already the parsed From (Name (1001)). Shows the extension/number for both directions. (Edge case left: onCallerName is global, not per-UUID — a call-waiting second INVITE could overwrite the active call's name; revisit if call-waiting display matters.)

Error 95: Alphion ONT FXS gateway — fake busy / can't-dial-out / no 2nd call¶

Three distinct faults hit on the Alphion ONT FXS gateway at Thangavelu Hospital (org bd5706c3-cf18-424b-a790-368019bc40eb / org_mp3t4g5m, 2026-07-12). The Alphion is an analog telephone adapter with a web UI at its LAN IP (e.g. http://192.168.1.118) that registers each FXS port as a SIP user to devsip.astradial.com:5080 UDP. All three faults are device-side — the prod dialplan and pjsip config were correct throughout.

Provisioning reference for the other ATA families (Grandstream GXW/HT8xx, Dinstar DAG) is Provision an ATA onto the VPN. Alphion is a different family; the fixes below are Alphion-specific web-UI paths.

95a — Fake busy tone, 0s, on every internal call¶

Symptom: every phone on the ATA gets a busy tone the instant it dials any internal extension. Call logs show the calls reaching Asterisk with correctly-resolved extension names (311 → Pharmacy, 308 → Ward), each Busy / 0s duration.

Root cause: the ATA's own SIP ALG (Application Layer Gateway). It intercepts the device's SIP on its control port (default 64888), rewrites SIP/SDP on the way out, then cannot correlate the return INVITE Asterisk sends back for the callee's line — so the ATA rejects it as 486 Busy. This is not a digit-map, dialplan, or DEVICE_STATE problem.

Diagnosis (rule the server out first):

# Endpoints ARE registered + reachable and device state is clean:
asterisk -rx "pjsip show endpoint <epid>"        # State: Not in use, Contact: Avail
asterisk -rx "dialplan eval function DEVICE_STATE(PJSIP/<epid>)"   # NOT_INUSE
# All FXS lines share one NAT source port (normal for a multi-account ATA, NOT the bug):
asterisk -rx "pjsip show contacts" | grep <public-ip>   # e.g. 16 × 103.164.181.2:2050

If the calls resolve correctly and the endpoints are NOT_INUSE/Avail yet still busy, the fault is on the ATA, not Asterisk.

Fix (on the Alphion web UI): Network → Advanced Options → ALG → uncheck SIP (leave FTP/H323/L2TP/RTSP/IPSEC/PPTP) → Save → Apply. Registrations blip and re-register (~30 s); then internal calls ring.

Prevention rule: disable SIP ALG on every Alphion ONT at provisioning. (VPN-over-tunnel is the cleaner fix — no NAT in the path at all — but hospital sites often can't run OpenVPN, so ALG-off on the device is the field fix.)

95b — Extension can receive incoming but cannot dial out (ext 315)¶

Symptom: one extension (OT Room, ext 315) rings fine on inbound but cannot originate a call.

Root cause A (the actual fix): the analog phone's P/T switch was on P (Pulse). Pulse dialing sends no DTMF, so the ATA never collects the dialed digits → no outbound INVITE. Set the handset to T (Tone).

Root cause B (also present, worth clearing): a stuck/orphaned channel pinning the FXS port busy. These Alphion FXS ports do not reliably send BYE, and the endpoint had no RTP timeout, so a dropped call leaks an Up channel forever:

asterisk -rx "core show channels" | grep <epid>
#   PJSIP/<epid>-XXXX   Up   Dial(PJSIP/<other-epid>,30,tT)   315→311
asterisk -rx "dialplan eval function DEVICE_STATE(PJSIP/<epid>)"   # INUSE  (should be NOT_INUSE)

A pinned INUSE port still receives (contact is registered) but can't originate.

Fix:

# Set the handset to Tone (T), then clear any leaked channel:
asterisk -rx "channel request hangup PJSIP/<epid>-XXXX"

Prevention rule: standardise handsets on Tone dialing. Add RTP timeouts to the org endpoints so leaked channels self-clear ~2 min after the phone hangs up (needs prod pjsip edit + pjsip reload, per Rules 1–3):

rtp_timeout=120
rtp_timeout_hold=120

95c — Enable call waiting / allow a 2nd call¶

Symptom: a phone already on a call can't accept a second incoming call (or the caller gets busy).

Fix (on the Alphion web UI): Voice → Voice Mode → Supplementary → for each user 01–16 click Check/Edit → tick cw-service (call waiting) + Call hold (required to flash between the two calls; without it call waiting can't switch) → optionally Three-Way Calling for conferences → Save → next user. After all 16, Apply on the list page. Leave call forwarding as-is; don't touch Hotline type / special dial tone.

cw-service only engages when the line is already on a call. It does not fix the 95a first-call fake-busy — that's SIP ALG.

On-call feature codes vs the digit map (context)¶

The Digit Map (Voice → Digit Map Settings, Max match) only parses the idle/dialtone dial string. Of the on-call feature codes, only *8 (directed pickup) is an idle dial that belongs in the map. ## (blind xfer), *2 (attended xfer), *5 (agent number), and *1/*3/*4 (abort / 3-way / swap) are in-call DTMF codes from features.conf (featuremap/applicationmap) — the digit map cannot enable them; they require dtmf_mode=rfc4733 / RFC2833 DTMF on the ATA (verified on all Thangavelu endpoints). Listing them as standalone idle patterns would itself return busy from idle.

Error 96: GCS "Cloud Storage" bill exploding (₹4k+/mo) — rclone list-egress, not storage¶

Symptom: GCS Cloud Storage cost on the Zazmic billing account spikes from ~₹0 to ₹600–680/day, flat (₹4,482 over Jul 1–13 2026, 2346%), projecting ~$125/mo vs the $2/mo cap. Storage volume is unchanged and no data is lost — every backup is healthy.

Root cause (not what it looks like): the cost is list-response egress, not stored bytes. move-recordings.sh (prod) runs every 5 min and each rclone copy/move traversed the flat astra_pbx/recordings/ prefix — now ~100,692 objects — in a us-central1 bucket. rclone lists the entire destination before transferring (~101 pages × ~1.1 MB ≈ 90 MB JSON per sweep); from Mumbai that's inter-continent "Download APAC" egress (₹4,141 = 91%) plus Class A ListObjects ops (₹291, +513%). Trigger: the Jul-4 box cutover crontab set */5 (12× the documented hourly) on an already-huge corpus → the flat daily curve. SKU proof: billing report Group by SKU shows Download APAC + Regional Standard Class A Operations, storage only ~₹99.

Diagnose:

# billing console → Reports → Group by SKU (the "Download APAC" line is the tell)
ssh root@147.93.168.216 'grep rclone /opt/astrapbx/scripts/move-recordings.sh'   # traversal on a flat prefix?
# Cloud Monitoring: api_request_count by method on the bucket → ListObjects ~2,500/hr = smoking gun

Fix: add --no-traverse to every rclone copy/move into the large flat prefix → rclone does a per-source-file HEAD instead of listing the whole destination. Only a few new WAVs exist per run (--min-age 5m; move deletes sources on success) → strictly cheaper, idempotent, identical upload behaviour, no data-loss risk. Cuts ~95% of the cost even at 5-min cadence. Verify: after a cron run tail /var/log/rclone-recordings.log still shows Copied (new)/Deleted + Sync complete, and ListObjects/hr collapses to the low tens. (PR astradial-platform#534.)

Prevent: never sync a flat, high-object-count prefix repeatedly without --no-traverse. Long-term: date-partition (recordings/YYYY/MM/) so any listing stays small (also needed for the 2029 purge), and/or host the bucket in asia-south1 to make egress India-local. See the Backup & Restore Runbook → Cost.

Error 97: Android app one-way audio on a real device — RECORD_AUDIO never granted¶

Date: 2026-07-14 · Env: AstraDial Android (com.astradial.phone, own AstraSIP+WebRTC engine) on the V2037 (real device). Distinct from Error 91 (emulator artifact) — this is a real-device permission bug.

Symptoms: call connects, the user hears the far side, but the far side hears nothing (device transmits no voice); echo test plays back silence.

Root cause: the mic permission was not granted. dumpsys package com.astradial.phone | grep RECORD_AUDIO showed granted=false, flags=[… ONE_TIME]. The app only requested RECORD_AUDIO once, at the main tab shell (MainActivity), so (a) a user who tapped "Only this time" lost it on next cold start, and (b) after one or two denials Android stops showing the dialog (permanent deny) → the in-app request becomes a silent no-op. No mic permission → WebRTC AudioRecord produces no outgoing RTP.

Fix: request RECORD_AUDIO at call time, not just at startup — the VoIP dial path checks the grant and prompts right before the call, and toasts "Enable Microphone in Settings" if permanently denied (SIM-calls path needs no app mic). Immediate unblock on a test device: adb shell pm grant com.astradial.phone android.permission.RECORD_AUDIO then force-stop. Commit 77c5aeb.

Prevent: for any calling feature, gate the action on the runtime permission at the point of use; never rely on a one-shot startup request. Tell users to pick "While using the app", not "Only this time".

Error 98: Android incoming call is silent + shows as a heads-up notification (not full-screen) when unlocked¶

Date: 2026-07-14 · Env: AstraDial Android — self-managed Telecom ConnectionService.

Symptoms: incoming call plays no ringtone; when the phone is unlocked it appears as a heads-up notification instead of the full-screen call UI; unlocking shows the call screen but still no ring audio.

Root cause: two things. (1) A self-managed ConnectionService gets no OS ringtone and no OS incoming UI — the app must play its own ring, and there was zero ringtone code. (2) A setFullScreenIntent is demoted to a heads-up banner by Android whenever the device is unlocked/interactive — that's platform policy, not a bug (full-screen only fires when locked/screen-off).

Fix: AstraConnection.showIncomingUi() now starts a looping ringtone (MediaPlayer, USAGE_NOTIFICATION_RINGTONE) + vibration, gated on AudioManager.ringerMode (normal/vibrate/silent); stopped on answer/reject/finish. The notification gained Answer / Decline actions so the unlocked heads-up is audible + actionable. +VIBRATE permission. Commit 5abf7c8. The unlocked heads-up itself can't be forced full-screen — that's Android policy; making it ring + actionable resolves the complaint.

Prevent: any self-managed calling app must own its ringtone + incoming UI; don't expect the OS to ring for PROPERTY_SELF_MANAGED connections.

Error 99: Android declined call keeps re-arriving — 486 Busy + a wake-race sent no final SIP response¶

Date: 2026-07-14 · Env: AstraDial Android — self-managed Telecom + astrasip.

Symptoms: the user declines an incoming call but it rings again shortly after (caller/dialplan retries).

Root cause: two bugs. (1) Reject sent 486 Busy Here, which makes Asterisk treat the endpoint as busy and retry/fork → the call comes back. A user decline should be 603 Decline (definitive). (2) On the FCM-wake path the placeholder Telecom Connection is created before the INVITE arrives; a reject during that window called decline() on a null CallSession → no final SIP response at all, so the INVITE kept ringing until timeout.

Fix: AstraConnection.onReject() now sends decline(busy = false) → 603 Decline, sets a rejected flag, and bind() declines the INVITE the instant it lands — closing the pre-INVITE wake race. Commit 5abf7c8.

Prevent: for a user-initiated reject use 603, not 486; and any placeholder-before-INVITE design must remember a pre-INVITE reject and apply it on bind.

Error 100: Android call screen doesn't blank when held to the ear¶

Date: 2026-07-14 · Env: AstraDial Android — in-call.

Symptoms: during a call, bringing the phone to the ear does not turn the screen off (unlike a normal phone call) → cheek touches.

Root cause: no proximity wake lock was acquired for the call.

Fix: Engine.enterCallAudio() acquires a PROXIMITY_SCREEN_OFF_WAKE_LOCK; exitCallAudio() releases it (kernel/sensor-driven, no UI needed). Commit 77c5aeb.

Prevent: acquire the proximity lock whenever call audio is active, release it on call end.

Error 101: FXS gateway returns 486 Busy on an IDLE port — "fake busy" from a lost BYE¶

Date: 2026-07-18 · Env: Mithra Scans (org_mpzgc0af, Grandstream GXW4216V2, fw 1.0.25.2), SIP on the public path. Same class of symptom possible on any NAT'd FXS gateway. Severity: P2 — intermittent; ~21 internal calls/day failed to connect, patients (queue/inbound) largely unaffected.

Symptom¶

Staff report "many calls go to busy" and — critically — "the phone was in front of me, idle, and never rang, yet the caller got a busy tone." Call Logs show many Busy rows (retry-inflated). The device's Status → Port Status shows the port On Hook / Registered.

Root cause (proven on the wire, not inferred)¶

A call ends but its BYE is lost crossing the hospital NAT / SIP-ALG path. Asterisk tears down its leg and repeatedly BYEs the device; the GXW never receives a BYE it accepts, so its FXS port stays allocated ("in use") and returns 486 Busy Here with Warning: 399 GS "All channels are in use" to every new INVITE — while the handset is physically idle — for ~10 min, until a retried BYE finally lands and the port frees. rtp_timeout=0 on the endpoints (the default) means Asterisk has no reaper to shorten that window.

Packet capture (tcpdump ... "host <site-public-ip> and udp port 5080", SIP only) caught it directly: 28× 486 Busy Here on ports the CDR proved had ZERO overlapping calls, distinct non-sequential INVITE CSeqs (many separate callers, one stuck port), and Asterisk BYEs the device 486-refused until a burst landed and the port recovered.

Differential diagnosis — this is NOT the other fake-busy causes¶

NOT Error 95 SIP-ALG 486: that mangles the inbound INVITE correlation → total, persistent busy the instant any extension is dialled. This one is intermittent, per-port, self-healing, and hits idle ports selectively.
NOT the DEVSTATE pre-empt (PR #484/#485, intercom fake-busy): the dialplan already does engaged → Dial; lastapp=Busy appears on almost no rows (the Busy(20) path is rarely reached — a hangup-handler overwrites lastapp, so that CDR field is NOT a reliable "heard busy" count).
NOT Error 71 (the earlier Mithra "not reaching" = loose cable / queue found no member). Different symptom (no ring vs 486 busy), different cause.
NOT genuine occupancy: a real busy is the port actually on another call.

The real-vs-fake verdict method (reusable)¶

For each captured 486, resolve the endpoint → extension and ask the CDR whether that endpoint had an ANSWERED call spanning that timestamp (match on the PJSIP/<endpoint>- prefix — the full dstchannel suffix is unique per call and will never self-match — and check both channel and dstchannel so an outbound leg counts). 0 overlap = fake (idle port). Also beware: the Call Logs "Busy" filter and raw disposition=BUSY counts are retry-inflated ~5×; the number that matters is distinct linkedids that never reached an ANSWERED leg.

Fix¶

Root cure — repoint SIP onto the VPN tunnel (SIP Server 10.21.0.1:5080, NAT Traversal = VPN). No NAT/ALG in the path → BYEs are delivered → ports free normally → ghosts don't form. Same repoint as ACR (L-1); needs a maintenance window (16 ports blip ~1–2 min) and, because tunnel-only SIP is a new single point of failure, set Failover SIP Server + Prefer Primary = Yes (see the ATA provisioning guide). ⚠ Measured caveat: the repoint reduces but does not fully eliminate fake busy (a rare BYE can still drop even over the tunnel; ACR post-repoint still saw a few) — verify with a follow-up capture, don't assume.
Defense-in-depth — rtp_timeout reaper (per-org, shrinks each stuck window ~10 min → N s). Shipping in PR #535 (OPEN, not yet merged/deployed) — reference only; do not treat as available yet. Only safe where the device has silence suppression OFF (VAD-on endpoints stream no RTP during silence → a short timeout would drop a live call on a long pause).

Prevention rule¶

When staff say "idle phone, never rang, caller got busy": believe the staff over the CDR aggregate (the metric is retry-inflated and lastapp-overwritten). Confirm with a bounded SIP capture + the overlap verdict above. The presence of Warning: 399 "All channels are in use" on an idle port = stuck channel from a lost BYE, not real occupancy. Root fix is removing NAT from the SIP path (tunnel); rtp_timeout only shortens the damage. See also memory mithra-fake-busy-lost-bye-nat, intercom-fake-busy-rca-and-fix.