Troubleshooting¶
This page documents all known errors encountered in the Astradial infrastructure, their root causes, and fixes.
Error 1: Asterisk Won't Start -- Bind to NNI IP Fails¶
Symptom¶
Asterisk fails to start with an error about being unable to bind to the NNI interface IP address.
Root Cause¶
The PJSIP transport was configured to bind to the NNI interface IP (e.g., 172.16.x.x). When the NNI interface (enp86s0) is DOWN -- for example after a reboot before the interface comes up -- Asterisk cannot bind to that IP and refuses to start.
Diagnosis¶
Fix¶
Change the PJSIP transport bind address to 0.0.0.0 instead of the specific NNI IP:
Prevention Rule¶
Never bind PJSIP transports to a specific interface IP. Always use 0.0.0.0 to bind on all interfaces.
Error 2: Tata Calls "Number Incorrect"¶
Symptom¶
Outbound or inbound calls via Tata fail with "number incorrect" or similar rejection. Calls may partially work but SIP responses are not reaching Tata.
Root Cause¶
Two issues combined:
- Routes for the Tata network were missing from the NUC routing table.
- SIP responses were going out the wrong network interface (default route instead of the NNI interface).
Diagnosis¶
# Check routing table
ip route show
# Check which interface SIP traffic uses
tcpdump -i enp86s0 port 5060
tcpdump -i any port 5060
Fix¶
Add explicit routes to ensure Tata SIP traffic goes through the NNI gateway:
Make routes persistent in /etc/network/interfaces.d/enp86s0.
Prevention Rule¶
Always verify routing after network changes. SIP responses must leave through the same interface the request arrived on.
Error 3: tata-endpoint Shows "Unavailable"¶
Symptom¶
pjsip show endpoints shows the tata_gateway endpoint status as "Unavailable".
Root Cause¶
Asterisk sends SIP OPTIONS to qualify endpoints. Tata's SBC ignores OPTIONS requests and never responds, so Asterisk marks the endpoint as unavailable.
Diagnosis¶
Fix¶
Disable qualify by setting qualify_frequency=0 in the AOR:
Prevention Rule¶
This is expected behavior -- the endpoint will always show "Unavailable". Do not use endpoint status to determine if the Tata trunk is working. Test with actual calls instead.
Error 4: PJSIP_EFAILEDCREDENTIAL¶
Symptom¶
SIP authentication fails with PJSIP_EFAILEDCREDENTIAL in Asterisk logs when the NUC tries to authenticate with the cloud Asterisk.
Root Cause¶
There is an authentication incompatibility between Asterisk 22 and Asterisk 20 (or between certain PJSIP versions). Username/password-based SIP authentication fails across these versions.
Diagnosis¶
Fix¶
Replace username/password authentication with IP-based identification:
[tata_gateway]
type=endpoint
identify_by=ip
[tata_gateway]
type=identify
endpoint=tata_gateway
match=10.10.10.2
Prevention Rule¶
Use identify_by=ip for machine-to-machine SIP connections where both sides have known, fixed IPs (e.g., over WireGuard).
Error 5: systemctl Shows "active (exited)" but Nothing Running¶
Symptom¶
systemctl status asterisk shows active (exited) but Asterisk is not actually running. No process is found.
Root Cause¶
The default Asterisk package installs an init.d script that systemd wraps. The init.d script has a bug where it reports success (exits 0) even when Asterisk fails to start or is not running.
Diagnosis¶
systemctl status asterisk
# Shows "active (exited)" instead of "active (running)"
ps aux | grep asterisk
# No asterisk process found
Fix¶
Create a proper systemd service file that manages Asterisk directly instead of relying on the init.d wrapper:
[Unit]
Description=Asterisk PBX
After=network.target
[Service]
Type=simple
ExecStart=/usr/sbin/asterisk -f -C /etc/asterisk/asterisk.conf
ExecReload=/usr/sbin/asterisk -rx "core reload"
Restart=on-failure
[Install]
WantedBy=multi-user.target
Prevention Rule¶
Always use native systemd service files. Do not rely on init.d compatibility wrappers.
Error 6: enp86s0 DOWN After Reboot¶
Symptom¶
After a NUC reboot, the NNI interface (enp86s0) is DOWN. The Tata trunk has no network connectivity.
Root Cause¶
The NNI interface was configured manually (via ip commands) but not persisted in network configuration files. After reboot, the interface remains unconfigured and DOWN.
Diagnosis¶
Fix¶
Create a persistent configuration in /etc/network/interfaces.d/enp86s0:
Prevention Rule¶
Never configure network interfaces with transient ip commands alone. Always persist configuration in /etc/network/interfaces.d/.
Error 7: First Call After Restart Fails¶
Symptom¶
The first inbound or outbound call after an Asterisk restart fails. Subsequent calls work fine.
Root Cause¶
After Asterisk restarts, SIP registrations and endpoint discovery take time to complete. If a call arrives before the Tata gateway endpoint is fully initialized (30-60 seconds), it will fail.
Diagnosis¶
# Check how long ago Asterisk started
asterisk -rx "core show uptime"
# Check endpoint status
asterisk -rx "pjsip show endpoints"
Fix¶
Wait 30-60 seconds after restarting Asterisk before routing live calls. There is no configuration fix -- this is inherent to the SIP registration process.
Prevention Rule¶
After any Asterisk restart, verify the trunk is ready by placing a test call before routing production traffic. Avoid restarting Asterisk during business hours when possible.
Error 8: NUC Public IP Changed -- Cloud Rejected¶
Symptom¶
The NUC's SIP connection to the cloud Asterisk stops working. The cloud Asterisk rejects packets from the NUC because its public IP has changed (ISP dynamic IP).
Root Cause¶
The NUC's ISP assigns a dynamic public IP. When it changes, the cloud Asterisk's IP-based identification no longer matches, and all traffic from the NUC is rejected.
Diagnosis¶
# On NUC: check current public IP
curl ifconfig.me
# On cloud: check what IP the identify section expects
asterisk -rx "pjsip show identify tata_gateway"
Fix¶
Use a WireGuard VPN tunnel between the NUC and cloud server. WireGuard assigns fixed tunnel IPs (10.10.10.1 for cloud, 10.10.10.2 for NUC) regardless of the underlying public IP.
; Cloud Asterisk uses WireGuard IP for identification
[tata_gateway]
type=identify
endpoint=tata_gateway
match=10.10.10.2
Prevention Rule¶
Never rely on public IPs for SIP endpoint identification when either side has a dynamic IP. Always use a VPN tunnel with fixed IPs.
Error 9: Zoiper 401 Unauthorized¶
Symptom¶
Zoiper softphone on the NUC's local network receives 401 Unauthorized when trying to register with the cloud Asterisk, even though credentials are correct.
Root Cause¶
The NUC and Zoiper share the same public IP (both are behind the same NAT). The cloud Asterisk's identify_by=ip matches the source IP to the tata_gateway endpoint before evaluating Zoiper's credentials, causing an identification conflict.
Diagnosis¶
# On cloud Asterisk
pjsip set logger on
# Observe that Zoiper REGISTER is matched to tata_gateway endpoint
Fix¶
The WireGuard tunnel solves this. The NUC's Asterisk traffic goes through the tunnel (source IP 10.10.10.2), while Zoiper's traffic goes through the public IP. These are now different source IPs, so identification works correctly.
Prevention Rule¶
When using identify_by=ip, ensure each endpoint has a unique source IP. Use VPN tunnels to separate traffic from hosts sharing a public IP.
Error 10: ISP Blocks UDP 5060¶
Symptom¶
SIP softphones on certain networks cannot register or make calls. Error messages include "SIP UDP not found" or connection timeouts.
Root Cause¶
Some ISPs block outbound UDP traffic on port 5060 as a measure against SIP abuse or toll fraud.
Diagnosis¶
# From the affected network
nc -zuv 89.116.31.109 5060
# Timeout or connection refused
# Try alternate port
nc -zuv 89.116.31.109 5080
# Success
Fix¶
Configure an alternate SIP transport on port 5080:
[transport-udp-alt]
type=transport
protocol=udp
bind=0.0.0.0:5080
external_media_address=89.116.31.109
external_signaling_address=89.116.31.109
Clients behind restrictive ISPs connect to port 5080 instead of 5060.
Prevention Rule¶
Always provide an alternate SIP port (5080) as a standard part of the deployment. Document it for end users experiencing connectivity issues.
Error 11: NUC Crashed During Netdata Build¶
Symptom¶
The NUC became unresponsive and powered off during Netdata compilation. After power-on, no Netdata installation was present.
Root Cause¶
Compiling Netdata from source is CPU-intensive. The NUC's passive cooling was insufficient, causing the CPU to overheat and trigger a thermal shutdown.
Diagnosis¶
Fix¶
Install Netdata using the --static-only flag, which downloads a pre-built static binary instead of compiling from source:
Prevention Rule¶
Never compile large software from source on passively-cooled or low-power hardware. Always use pre-built binaries or static builds.
Error 12: Upptime 404¶
Symptom¶
The Upptime status page at status.astradial.com returns a 404 error. The GitHub Actions workflows run successfully and the site appears to build, but the page is not accessible.
Root Cause¶
The Upptime workflows build the static site but do not deploy it to GitHub Pages. Without a deployment step, the gh-pages branch is never updated and GitHub Pages has nothing to serve.
Diagnosis¶
# Check if gh-pages branch exists and has recent commits
gh api repos/astradial/upptime/branches/gh-pages
Fix¶
Add the peaceiris/actions-gh-pages action to the workflow to deploy the built site to the gh-pages branch:
- uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./build
Prevention Rule¶
When using GitHub Pages with a build step, always include an explicit deployment action. Verify the gh-pages branch is being updated after workflow runs.
Error 13: POST /api/v1/users Returns 500 "Validation error"¶
Symptom¶
Creating a user via POST /api/v1/users returns HTTP 500 with {"error":"Validation error"}, even though all required fields are provided and valid.
Root Cause¶
The username column in the users table had a global unique constraint. In a multi-tenant system, this meant no two organizations could have a user with the same username. MySQL/MariaDB's default case-insensitive collation made this worse -- "Hari" and "hari" were treated as duplicates.
The catch block in the endpoint returned error.message directly, and Sequelize's SequelizeUniqueConstraintError has the message "Validation error" -- making the real cause invisible.
Diagnosis¶
# Check PM2 logs for the 500
pm2 logs astrapbx --lines 50 --nostream
# Check existing users for duplicate usernames
cd /opt/astrapbx && node -e "
const db = require('./src/models');
db.sequelize.authenticate().then(() =>
db.User.findAll({attributes: ['username','org_id']})
.then(u => { console.log(JSON.stringify(u, null, 2)); process.exit(); })
);
"
# Check indexes on users table
cd /opt/astrapbx && node -e "
const db = require('./src/models');
db.sequelize.getQueryInterface().showIndex('users').then(indexes => {
console.log(JSON.stringify(indexes.filter(i => i.unique), null, 2));
process.exit();
});
"
Fix¶
1. Changed the unique constraint from global to per-org (src/models/User.js):
Removed unique: true from the username field and added a composite unique index:
// Before
username: { type: DataTypes.STRING(50), allowNull: false, unique: true }
// After
username: { type: DataTypes.STRING(50), allowNull: false }
// With composite index in model options:
indexes: [{ unique: true, fields: ['org_id', 'username'], name: 'unique_org_username' }]
2. Added proper error handling (src/server.js -- POST /users catch block):
} catch (error) {
if (error.name === 'SequelizeUniqueConstraintError') {
return res.status(409).json({ error: 'Username already exists' });
}
if (error.name === 'SequelizeValidationError') {
return res.status(400).json({ error: error.errors.map(e => e.message).join(', ') });
}
res.status(500).json({ error: error.message });
}
3. Added app-level username uniqueness check (before User.create):
const existingUsername = await User.findOne({ where: { org_id: req.orgId, username } });
if (existingUsername) {
return res.status(409).json({ error: 'Username already exists' });
}
4. Database migration (run once on the server):
cd /opt/astrapbx && node -e "
const db = require('./src/models');
const qi = db.sequelize.getQueryInterface();
(async () => {
await qi.removeIndex('users', 'username');
await qi.addIndex('users', ['org_id', 'username'], { unique: true, name: 'unique_org_username' });
process.exit();
})();
"
Prevention Rule¶
In multi-tenant systems, unique constraints on user-facing fields (username, extension, etc.) must always be scoped to the organization (org_id), never global. Always handle SequelizeUniqueConstraintError and SequelizeValidationError explicitly in catch blocks -- never let them fall through as generic 500 errors.
Error 14: Outbound Calls 403 Forbidden -- Caller ID Format¶
Symptom¶
Outbound calls via the Tata trunk return 403 Forbidden - 6034. The call connects to the NUC gateway successfully, the NUC forwards to Tata's SBC, but Tata rejects immediately.
Root Cause¶
Two issues combined:
-
No caller ID set on outbound route — The outbound route's
caller_id_overridewasnull, so Asterisk sent the internal extension number (e.g.,1002) oranonymousas the caller ID. Tata rejects calls without a valid DID as caller ID. -
NUC double-prefixed the caller ID — The NUC's
from-cloudcontext didSet(CALLERID(num)=+91${CALLERID(num)}). When the cloud sent08065978002(with leading0), the NUC produced+9108065978002— an invalid E.164 number.
Diagnosis¶
# On the cloud server - check outbound route caller ID
curl -s -H 'X-API-Key: org_XXXX' \
http://localhost:8000/api/v1/outbound-routes | python3 -m json.tool
# On the NUC - enable SIP logger and check the INVITE sent to Tata
asterisk -rx 'pjsip set logger on'
# Look for the From: header in the INVITE to 10.79.215.102
SIP trace showing the problem:
INVITE sip:07400464659@10.79.215.102:5060
From: "GrandEstancia" <sip:+9108065978002@192.168.0.14>
^^^^^^^^^^^^ double-prefixed
SIP/2.0 403 Forbidden - 6034
Fix¶
1. Set caller ID on the outbound route (cloud API):
curl -X PUT -H 'X-API-Key: org_XXXX' \
-H 'Content-Type: application/json' \
-d '{"caller_id_override": "08065978002", "caller_id_name_override": "CustomerName"}' \
http://localhost:8000/api/v1/outbound-routes/{routeId}
Then redeploy: POST /api/v1/config/deploy with {"reload": true}.
2. Fix NUC caller ID transformation (/etc/asterisk/extensions.conf on NUC):
; Before (wrong - double-prefixes numbers starting with 0)
same => n,Set(CALLERID(num)=+91${CALLERID(num)})
; After (correct - strips leading 0, then adds +91)
same => n,Set(CALLERID(num)=+91${CALLERID(num):1})
Reload NUC dialplan: asterisk -rx 'dialplan reload'
Prevention Rule¶
Every outbound route must have caller_id_override set to the org's DID number. The NUC's from-cloud context must strip the leading 0 before adding +91 — use ${CALLERID(num):1} (Asterisk substring notation: skip first character).
Error 15: Codec Translation Error ulaw to opus¶
Symptom¶
Internal calls between extensions fail. Asterisk logs show:
The call goes straight to "offline" playback without ringing.
Root Cause¶
Asterisk has res_format_attr_opus.so loaded (format description module) but not codec_opus.so (actual transcoder). When a user endpoint allows opus and the other side uses ulaw (e.g., Local channels for outbound routing, or trunk calls), Asterisk cannot transcode between them.
Diagnosis¶
# Check if opus codec module exists
asterisk -rx 'module show like opus'
# Shows res_format_attr_opus but NOT codec_opus
# Check translation paths
asterisk -rx 'core show translation paths ulaw' | grep opus
# Shows "No Translation Path"
Fix¶
Remove opus from user endpoint allow lists. The endpoints still get HD audio via g722:
// In userProvisioningService.js
// Before
config += `allow=ulaw,alaw,g722,opus\n`;
// After
config += `allow=ulaw,alaw,g722\n`;
Redeploy config and reload PJSIP: asterisk -rx 'pjsip reload'
Users must re-register their softphones (disconnect/reconnect in Zoiper) to renegotiate codecs.
Prevention Rule¶
Only include codecs in endpoint allow lists if the corresponding codec translator module is installed. Use core show translation paths to verify translation paths exist before adding a codec.
Error 16: Phone Number Forwarding Not Working (ring_target=phone)¶
Symptom¶
User extension has ring_target: "phone" and phone_number set, but calling the extension plays "is not available" instead of ringing the external phone.
Root Cause¶
Multiple issues can cause this:
- No outbound route configured —
ring_target: "phone"usesDial(Local/number@outbound_context)which requires a working outbound route - Caller ID not set on outbound route — Tata returns 403 (see Error 14)
- Opus codec error — Local channel cannot transcode (see Error 15)
- Config not redeployed — Database updated but Asterisk config not regenerated
Diagnosis¶
# Check the user's settings
curl -s -H 'X-API-Key: org_XXXX' \
http://localhost:8000/api/v1/users/{userId} | python3 -m json.tool | grep -E 'ring_target|phone_number'
# Check the generated dialplan
asterisk -rx 'dialplan show {extension}@{context_prefix}__internal'
# Should show: Dial(Local/{phone_number}@{context_prefix}__outbound/n,30,tT)
# Enable verbose logging and make a test call
asterisk -rx 'core set verbose 5'
# Check full.log for DIALSTATUS=CHANUNAVAIL
Fix¶
Ensure all prerequisites are in place:
- Outbound route exists with
caller_id_overrideset to the DID - Opus removed from endpoint allow lists (or codec_opus installed)
- NUC
from-cloudcontext correctly transforms caller ID - Config deployed:
POST /api/v1/config/deploywith{"reload": true}
Prevention Rule¶
ring_target: "phone" depends on the full outbound call chain working: cloud outbound context → trunk → NUC → Tata. Test outbound calling from Zoiper first before enabling phone forwarding.
Error 17: Queue Rings Zoiper Instead of Phone Number¶
Symptom¶
User has ring_target=phone and phone_number set, but when called through a queue (e.g., reception 5001), their Zoiper SIP client rings instead of their phone number. If Zoiper is not registered, the queue skips the member entirely (shows "Unavailable").
Root Cause¶
Queue members in queues.conf were generated as PJSIP/endpoint which rings the SIP endpoint directly, bypassing the extension dialplan where ring_target=phone logic lives.
Additionally, the state_interface was set to PJSIP/endpoint — when Zoiper wasn't registered, Asterisk marked the member as "Unavailable" and skipped them.
Fix¶
Modified queueService.js → generateQueueMemberString():
- For users with
ring_target=phone: useLocal/extension@internal_context/n(no state_interface) - For users with
routing_type=ai_agent: useLocal/extension@internal_context/n(no state_interface) - For regular SIP users: keep
PJSIP/endpoint(unchanged)
This routes queue calls through the extension dialplan where ring_target routing occurs.
# Before (broken)
member => PJSIP/org_mnd5khym_1001,0,"Hari",PJSIP/org_mnd5khym_1001
# After (fixed — phone user, no state_interface)
member => Local/1001@org_mnd5khym__internal/n,0,"Hari"
Verification¶
# Check queue member interfaces
asterisk -rx "queue show {queue_name}"
# Should show Local/... for phone users, PJSIP/... for SIP users
# All phone users should show "Not in use", not "Unavailable"
Error 18: Queue Timeout Not Working — Calls Ring Forever¶
Symptom¶
Queue max_wait_time set to 45 seconds via API, but calls keep ringing members indefinitely (5+ minutes) without routing to the timeout destination (e.g., AI agent on ext 1003).
Root Cause¶
The Queue() dialplan application had an extra comma, placing the timeout value in the AGI parameter slot (position 6) instead of the timeout slot (position 5):
# Bug — 45 is in position 6 (AGI), not position 5 (timeout)
Queue(org_mnd5khym_5001,cCtr,,,,45)
# Correct syntax
Queue(queuename,options,URL,announceoverride,timeout,AGI,...)
Fix¶
Fixed in dialplanGenerator.js → generateQueueExtension():
// Before (extra comma)
Queue(${name},${options},,,,${maxWait})
// After (correct)
Queue(${name},${options},,,${maxWait})
Verification¶
# Check generated dialplan
asterisk -rx "dialplan show 5001@org_mnd5khym__queue"
# Queue() should have exactly 4 commas before the timeout value
# e.g., Queue(org_mnd5khym_5001,cCtr,,,45)
# Test: make a call and verify it exits queue after timeout
# Check logs for: QUEUESTATUS=TIMEOUT and Goto to failover
Error 19: Queue Linear Strategy Rings Same Member Repeatedly¶
Symptom¶
With strategy=linear, the queue always rings the same member (lowest penalty) and never cycles to the next member, even after the first member doesn't answer.
Root Cause¶
linear strategy always starts from the first available member. With Local/ channel queue members (used for phone routing), there's no persistent PJSIP device state — all members appear as "Not in use" after each ring attempt. The queue picks the first member every time.
Fix¶
Changed queue strategy to rrmemory (round-robin with memory) via API:
curl -X PUT "https://devpbx.astradial.com/api/v1/queues/{id}" \
-H "X-API-Key: org_xxx" \
-d '{"strategy": "rrmemory"}'
Set equal penalties for all members so rrmemory cycles through them evenly.
Strategy Reference¶
| Strategy | Behavior | Best For |
|---|---|---|
ringall | Ring all members simultaneously | Small teams, fastest answer |
rrmemory | Round-robin, remembers last called | Equal distribution |
linear | Always starts from first member | Priority-based (use with PJSIP only) |
leastrecent | Rings member who was called least recently | Fair distribution |
fewestcalls | Rings member with fewest completed calls | Load balancing |
Error 20: Auto-Tickets Created for Unanswered Outbound AI Bot Calls¶
Symptom¶
Tickets with category missed_call appear in the editor's Tickets tab for phone numbers the system itself just dialled out to via the AI bot. The ticket creation timestamp lines up exactly with a workflow-engine scheduled job firing that originated an outbound AI welcome call.
Root Cause¶
For every AI bot outbound call, two CDR rows end up in asterisk_cdr:
- The manual row AstraPBX inserts at
server.js:3741(dcontext='ai-outbound',disposition='ANSWERED'hardcoded). Filterable. - The Asterisk-auto-generated row from
cdr_adaptive_odbcfor the actual PJSIP outbound dial. This second row looks identical to a real inbound call:src=<customer phone>,dst='s',dcontext='org_xxx__incoming',disposition='NO ANSWER', channelPJSIP/org_xxx_trunk_xxx-000000XX.
The CDR poller's old direction logic at /opt/astrapbx/src/server.js:4297 was:
That matched both real inbound calls AND the auto-generated outbound bot ringback row. The poller forwarded the row to events.astradial.com/auto-ticket/{org_id} with direction: "inbound", LogsUpdate's classifier saw disposition=NO ANSWER + non-queue dcontext → created a missed_call ticket.
Diagnosis¶
Query the asterisk_cdr table for the suspicious phone number:
ssh root@89.116.31.109 \
"mysql -upbx_api -ppbx_secure_password pbx_api_db -e \"
SELECT id, calldate, src, dst, dcontext, channel, disposition
FROM asterisk_cdr
WHERE (src='<phone>' OR dst='<phone>')
AND calldate > '<today>'
ORDER BY calldate DESC LIMIT 10\\G\""
You should see two rows: one with dcontext='ai-outbound', disposition='ANSWERED', dst=<customer> (the real outbound) and one with dcontext='*_incoming', disposition='NO ANSWER', src=<customer> (the bug-triggering ringback row).
Fix¶
In the CDR poller pollCdr() function, build a set of phone numbers that received an ai-outbound row in the same poll batch, and skip any inbound-looking row whose src matches one. Built from allRows (pre-dedup) so the manual ai-outbound row isn't eaten by the linkedid dedup step.
const outboundBotDests = new Set();
for (const r of allRows) {
if ((r.dcontext || '') === 'ai-outbound' && r.dst) {
outboundBotDests.add(String(r.dst).replace(/\D/g, ''));
}
}
// in the per-row classification loop:
const srcDigits = String(r.src || '').replace(/\D/g, '');
if (srcDigits && outboundBotDests.has(srcDigits)) {
console.log('CDR poll: skip ' + r.id + ' — matches outbound bot dest ' + srcDigits);
continue;
}
Live in /opt/astrapbx/src/server.js:4275-4317 since 2026-04-10.
Prevention Rule¶
Whenever you add a new outbound-call code path that goes through the trunk (AI bots, click-to-dial, originate-from-API, etc.), think about whether Asterisk's cdr_adaptive_odbc will write a ringback row that looks inbound, and add a same-batch correlation guard.
Error 21: Queue Edits in Editor Don't Save Most Fields¶
Symptom¶
Open Edit Queue in the editor → set Greeting, Ring Sound, Periodic Announce Frequency, Service Level, etc. → click Save → success toast → reload the queue → those fields are blank again. Only Name, Strategy, Timeout, Retry, Music On Hold, Recording Enabled, and Active stick.
Root Cause¶
PUT /api/v1/queues/:id in /opt/astrapbx/src/server.js had a hand-curated allowedFields whitelist with only 7 entries. Everything else was silently filtered out before reaching queue.update(updateData). Affected dropped fields included greeting_id, periodic_announce, ring_sound, announce_frequency, announce_position, all queue_* audio prompts, autopause, service_level, timeout_destination, and ~25 others.
Diagnosis¶
ssh root@89.116.31.109 \
"mysql -upbx_api -ppbx_secure_password pbx_api_db -e \"
SELECT name, periodic_announce, greeting_id, ring_sound, announce_frequency
FROM queues WHERE id='<queue-id>'\\G\""
If the editor shows the field set but the DB row has it as NULL or default, the PUT endpoint dropped it.
Fix¶
Expand allowedFields to cover every editable column. Live in /opt/astrapbx/src/server.js:1876-1890 since 2026-04-10:
const allowedFields = [
'name', 'strategy', 'timeout', 'retry', 'music_on_hold', 'recording_enabled', 'active', 'status',
'max_wait_time', 'wrap_up_time', 'weight', 'max_callers', 'max_len',
'greeting_id', 'periodic_announce', 'periodic_announce_frequency',
'min_announce_frequency', 'relative_periodic_announce',
'ring_sound', 'announce_frequency', 'announce_holdtime',
'announce_position', 'announce_position_limit', 'announce_round_seconds',
'autopause', 'autopausedelay', 'autopausebusy', 'autopauseunavail',
'service_level', 'timeoutpriority', 'memberdelay',
'join_empty', 'leave_when_empty', 'ring_inuse', 'ringinuse', 'reportholdtime',
'queue_youarenext', 'queue_thereare', 'queue_callswaiting', 'queue_holdtime',
'queue_minutes', 'queue_seconds', 'queue_thankyou', 'queue_reporthold',
'timeout_destination', 'timeout_destination_type'
];
There's also a separate greeting_id → periodic_announce resolution step in the same endpoint: when greeting_id is provided, look up greetings.audio_file and write greetings/<basename> (without extension) to periodic_announce. This is what the Asterisk dialplan generator actually reads.
Prevention Rule¶
Don't use a Pick<>-style allowlist for any "edit this DB row" endpoint unless you generate it from the model definition. Hand-maintained whitelists silently rot as schemas grow. Future fix: switch to Object.keys(Queue.rawAttributes).filter(...) or similar, with a small explicit blocklist (id, org_id, created_at, updated_at).
Error 22: Queue Changes Don't Take Effect Until astrapbx Restart¶
Symptom¶
Save a queue change in the editor → success toast → on-disk /etc/asterisk/{ext,queues}_<org>.conf files are updated → but the live Asterisk dialplan and queue state still show the old configuration. New greetings don't play, new members don't ring, removed members are still in the live queue. Restarting pm2 restart astrapbx "fixes" it temporarily because Asterisk reloads on its own when configs are written during init.
Root Cause — Two Separate Bugs¶
1. deployOrganizationConfiguration() writes files but never tells Asterisk to reload them. The function regenerates pjsip_<org>.conf, ext_<org>.conf, and queues_<org>.conf, then returns. No dialplan reload, no queue reload all. Asterisk's in-memory config keeps the old version forever.
2. Even after adding core reload, queue static members didn't update. asterisk -rx "core reload" reloads dialplan, MOH, PJSIP, etc. — but it does NOT reload app_queue.so strongly enough to pick up new member => lines from queues.conf. You need module reload app_queue.so (equivalent: queue reload all) explicitly.
persistentmembers=yes in [general] makes this worse: dynamic members added via queue add member get persisted in astdb and survive across reloads, so the queue looks like it has members (the stale ones) and you don't notice the static members weren't loaded.
Diagnosis¶
# 1. Compare on-disk file to live dialplan — if file has changes the live doesn't, reload is broken
ssh root@89.116.31.109 'grep -A20 "5002 - Front Office" /etc/asterisk/ext_grandestancia.conf'
ssh root@89.116.31.109 'asterisk -rx "dialplan show 5002@org_mnd5khym__queue"'
# 2. Compare DB queue members to live queue members
ssh root@89.116.31.109 "mysql ... -e 'SELECT * FROM queue_members WHERE queue_id=\"<id>\"'"
ssh root@89.116.31.109 'asterisk -rx "queue show org_mnd5khym_5002"'
# 3. If they diverge, force-reload
ssh root@89.116.31.109 'asterisk -rx "module reload app_queue.so"'
ssh root@89.116.31.109 'asterisk -rx "queue show org_mnd5khym_5002"' # should now match DB
Fix¶
Two changes in /opt/astrapbx/src/:
(a) services/asterisk/configDeploymentService.js:524 — reloadAsteriskConfiguration() now does both reloads:
await execAsync('asterisk -rx "core reload"');
await execAsync('asterisk -rx "module reload app_queue.so"');
(b) server.js — every queue PUT/POST/DELETE-member endpoint now calls await configDeploymentService.reloadAsteriskConfiguration() immediately after deployOrganizationConfiguration(). Three call sites: PUT queue (line 1942), POST member (line 2047), DELETE member (line 2105).
Prevention Rule¶
Whenever you add a new endpoint that modifies any DB row whose value ends up in an Asterisk config file, the chain MUST be: DB write → regenerate config files → reload Asterisk modules. The reload step is non-optional. Document and code-review for it.
For queue-specific changes, always explicitly reload app_queue.so in addition to core reload. Verified 2026-04-10 — core reload alone does NOT cover queues.conf static members.
Error 23: Queue Member Add/Remove Shows "Failed" Toast Despite Succeeding¶
Symptom¶
Click the X next to a queue member in the editor → toast says "Failed: Unexpected end of JSON input" or just "Failed" → but if you refresh the page, the member is actually gone. Same with "+ Add member..." in some flows: success in DB, error in UI. Users think the editor is broken.
Root Cause — Two Bugs Stacked¶
1. pbx/client.ts request helper called res.json() on every response. The DELETE member endpoint at /api/v1/queues/:queueId/members returns 204 No Content (res.status(204).send()). Calling .json() on a 204 response throws SyntaxError: Unexpected end of JSON input. The frontend's catch block fires showToast("Failed", "error") even though the DB delete succeeded.
2. The POST member endpoint had a body-shape mismatch. Frontend sent { user_ids: ["uuid"] } (plural array), backend destructured const { user_id, penalty } = req.body (singular). Backend read user_id = undefined, the User lookup failed, returned 400 {"error":"Invalid user"}. Frontend showed the toast but the row was never created — actually a real failure, not a 204 false positive.
Diagnosis¶
For Bug 1 (false positive on 204): open the browser DevTools Network tab, repeat the failing action, look for a DELETE request that returns 204 followed by a frontend-only error in the Console.
For Bug 2 (real POST failure): hit the endpoint directly with curl and the frontend-style payload to confirm the 400:
curl -X POST 'https://devpbx.astradial.com/api/v1/queues/<id>/members' \
-H 'X-API-Key: org_xxx' -H 'Content-Type: application/json' \
-d '{"user_ids":["<uuid>"]}'
# Old behavior: 400 {"error":"Invalid user"}
# New behavior: 201 {"created":[...]}
Fix¶
(a) Frontend — /opt/pipecat-flow-editor/lib/pbx/client.ts:37:
async function request<T>(path: string, opts: RequestInit = {}): Promise<T> {
const res = await fetch(`${BASE}${path}`, { ...opts, headers: headers() });
if (!res.ok) { /* throw */ }
if (res.status === 204) return undefined as unknown as T; // ← NEW
return res.json();
}
(b) Backend — /opt/astrapbx/src/server.js:1986 (POST member). Accept both user_id (single, legacy) and user_ids (array, batch). Validate all in one query, skip already-existing members instead of erroring out the whole batch, deploy + reload config exactly once at the end. Returns the single member object for legacy single calls (back-compat) or {created, skipped} for batch calls.
(c) UI dropdown reset — /opt/pipecat-flow-editor/app/dashboard/[orgId]/queues/page.tsx:599 — added a key prop on the "+ Add member..." <Select> so it remounts after each successful add and the placeholder returns:
Without that, the uncontrolled Select held the just-picked value internally and refused to fire onValueChange on the next click, blocking sequential adds.
Prevention Rule¶
Any HTTP client wrapper that handles arbitrary endpoints must branch on Content-Length: 0 or 204 No Content before calling .json(). Same applies to error responses with no body. Recommended pattern:
if (res.status === 204 || res.headers.get('content-length') === '0') {
return undefined as unknown as T;
}
For request/response shape mismatches, write a one-liner curl test against the deployed endpoint as part of any frontend client change. The toast saying "Failed" is not enough information — always check the actual HTTP status and body in DevTools Network tab.
Error 24: VPS Path is /opt/astrapbx (lowercase), not /opt/AstraPBX¶
Symptom¶
Following docs/guides/deploy-apps.md, ssh root@89.116.31.109 'ls /opt/AstraPBX/src/' returns "No such file or directory". Searching for code with grep -r "..." /opt/AstraPBX/ yields nothing.
Root Cause¶
The directory on the VPS is /opt/astrapbx (all lowercase). The doc had it as /opt/AstraPBX (camel case). PM2 confirms via pm2 info astrapbx | grep "script path" → /opt/astrapbx/src/server.js.
Fix¶
Always use /opt/astrapbx. The deploy guide is being updated to reflect this. Until then, treat the case in any doc as untrustworthy and verify with:
Prevention Rule¶
Don't trust path documentation — always verify against the running process. Add a CI check or doc-test that reads pm2 info and asserts the documented paths exist.
Error 25: NUC tata-endpoint CHANUNAVAIL After Asterisk Restart¶
Symptom¶
After restarting NUC Asterisk, all outbound calls via Dial(PJSIP/...@tata-endpoint) return CHANUNAVAIL. Inbound calls from Tata still work. Contact shows NonQual.
Root Cause¶
Asterisk 22 with qualify_frequency=0 and max_contacts=0 on the tata-aor AOR results in the static contact being NonQual forever after restart. The Dial() application requires the contact to be Available to create an outbound channel. Inbound works because Tata initiates the SIP dialog.
Diagnosis¶
sudo asterisk -x "pjsip show contacts" | grep tata
# Shows: tata-aor/sip:10.79.215.102:5060 ... NonQual -nan
# Must show "Avail" for outbound to work
sudo asterisk -x "channel originate PJSIP/919944421125@tata-endpoint application Wait 10"
# Returns immediately with 0 active channels = CHANUNAVAIL
Fix¶
Enable qualify and set max_contacts on the NUC's tata-aor in /etc/asterisk/pjsip.conf:
[tata-aor]
type=aor
contact=sip:10.79.215.102:5060
qualify_frequency=30 ; was 0 — must be >0 for Avail status
max_contacts=1 ; was 0 — required for outbound channel creation
Reload: sudo asterisk -x "module reload res_pjsip.so" and wait 30 seconds for qualify.
Prevention Rule¶
Any PJSIP AOR that needs outbound dialing MUST have qualify_frequency > 0 and max_contacts >= 1. Without these, outbound Dial() silently fails with CHANUNAVAIL after any Asterisk restart. This is an Asterisk 22 regression — older versions were more lenient.
Error 26: Staging Outbound Calls — Wrong Number Format for Tata NNI¶
Symptom¶
Staging outbound calls reach NUC from-cloud context but Tata returns 403 Forbidden or the call makes progress then drops after 20 seconds.
Root Cause¶
Tata NNI expects specific number and CallerID formats:
| What | Wrong format | Correct format |
|---|---|---|
| CallerID | CALLERID(num)=+91${CALLERID(num):1} → +91918065978001 (double country code) | CALLERID(all)=+918065978001 |
| Dial number | PJSIP/919944421125@tata-endpoint (with 91 prefix) | PJSIP/09944421125@tata-endpoint (with 0 prefix = local format) |
Using 91 prefix: Tata SBC accepts the INVITE (100 Trying) but can't route it → drops after 20s. Using 0 prefix: Tata routes successfully → phone rings.
Fix¶
NUC /etc/asterisk/extensions.conf [from-cloud]:
[from-cloud]
exten => _X.,1,NoOp(Cloud Outbound via Tata: ${EXTEN} CID: ${CALLERID(all)})
same => n,Set(CALLERID(all)=+918065978001)
same => n,Dial(PJSIP/0${EXTEN}@tata-endpoint,60)
same => n,NoOp(Tata dial status: ${DIALSTATUS})
same => n,Hangup()
Key points: - Use CALLERID(all) not CALLERID(num) — sets both name and number atomically - Hardcode a valid Tata DID as CallerID - Use 0${EXTEN} prefix for local dialing format (what Tata NNI expects for outbound)
Prevention Rule¶
Always test outbound number formats using NUC's testonly-outbound context first. Check DIALSTATUS in the log — CHANUNAVAIL means AOR/qualify issue, busy/congested means format rejected.
Error 27: rsync --delete Removes Server-Side Files on Deploy¶
Symptom¶
After CI/CD deploy, firebase-sa-key.json disappears from the API server. Firebase auth breaks with "Invalid Firebase token". Editor .env files also removed.
Root Cause¶
The CI/CD workflow uses rsync -a --delete which removes files on the destination that don't exist in the source. Server-only files (firebase-sa-key.json, .env, .env.local, recordings) are not in git and get deleted.
Fix¶
Exclude all server-side files in rsync:
# API deploy
rsync -a --delete \
--exclude=node_modules \
--exclude=.git \
--exclude=.env \
--exclude=.env.local \
--exclude='*.log' \
--exclude=firebase-sa-key.json \
--exclude=recordings/ \
api/ /opt/astrapbx/
# Editor deploy
rsync -a --delete \
--exclude=node_modules \
--exclude=.git \
--exclude=.next \
--exclude=.env \
--exclude=.env.local \
editor/ /opt/pipecat-flow-editor/
Prevention Rule¶
Every rsync --delete in a CI/CD workflow must have an --exclude for every file that exists only on the server. Keep a checklist: .env, .env.local, firebase-sa-key.json, recordings/, any uploaded media. Test deploys on staging first.
Error 28: request-org Self-Serve Crashing with notNull Violation¶
Date: 2026-04-17
Severity: High — blocked every client self-serve org signup
Symptoms¶
POST /api/v1/auth/request-org returns 500. pm2 logs show:
request-org error: notNull Violation: Organization.context_prefix cannot be null,
notNull Violation: Organization.api_key cannot be null
50+ of these over an hour on prod (= 50 failed signups).
Root Cause¶
The self-serve handler's Organization.create() was passing only name, status, api_secret, contact_info — missing context_prefix and api_key which are allowNull: false on the model.
Fix¶
Added the missing fields:
context_prefix: generateContextPrefix(),
api_key: `org_${uuidv4().replace(/-/g, '')}`,
domain: `${org_name.toLowerCase().replace(/[^a-z0-9]/g, '')}.local`,
Prevention¶
Every Organization.create() path must include all notNull fields. Add a unit test that hits both auth/request-org and POST /organizations with minimal bodies.
Error 29: Admin create-org Leaves Org Unusable (no org_users owner)¶
Date: 2026-04-17
Symptoms¶
Admin creates an org via UI → org appears in admin list → client can't log in with their email → Firebase auth returns "User not found".
Root Cause¶
POST /api/v1/organizations (admin flow) created the organisation + attempted a SIP users row but never inserted an org_users row. The org_users table is what the editor queries for Firebase login (via /api/v1/auth/user-login). No owner row → no login possible.
The SIP user creation also failed silently with "password_hash cannot be null" because the handler set password instead of password_hash and missed asterisk_endpoint.
Fix¶
After creating the organisation, insert an org_users row with role='owner', email=contact_info.email, status='active'. Also fixed the SIP user creation to hash the password and set asterisk_endpoint=<context_prefix>1001.
Prevention¶
Pair admin-created orgs with at least one owner login. Optionally send the owner an invite email with the Firebase sign-up link.
Error 30: Duplicate DID Records — Same Number, Different Formats/Orgs¶
Date: 2026-04-17
Symptoms¶
Inbound calls to a DID land on the wrong org. For +91 80659 78002, the call went to TechStart instead of GrandEstancia.
Root Cause¶
did_numbers had two rows for the same physical number:
08065978002→ GrandEstancia (Indian local format, created later)+918065978002→ TechStart (E.164, created earlier by a test setup)
Tata sends inbound as 918065978002, so the dispatcher matched the TechStart record first.
Fix¶
Deleted the duplicate row (the TechStart one). Extended the dispatcher generator to emit format aliases (08... AND 91...) for every Indian DID so customers dialing either way reach the same org.
Prevention¶
Enforce a canonical format when inserting DIDs. Suggested: strip all non-digit chars on insert, prepend +91 if it starts with 0. The generator handles both formats for display compatibility.
Error 31: Queue Calls Fall Through to AI Agent Without Ringing Members¶
Date: 2026-04-17
Severity: High — real customer inbound calls to GE's reservation queue going to AI instead of agents
Symptoms¶
Customer dials +91 80659 78002 → hears silence briefly → routed to AI voice agent (ext 1003) instead of the human reservation queue (ext 5003). No MOH during wait, no ring to agents.
Root Cause¶
The generated ext_grandestancia.conf queue 5003 section had no Answer() before Queue(). Queue 5001/5002 worked because they had greetings and therefore got Answer(). Queues without greetings (5003, 5004, 5005, 5006) skipped it.
Without Answer(), Queue() runs on an unanswered channel:
- Caller hears upstream ringback, no MOH
- Queue announces "your hold time is X" to the member when they pick up (the leg isn't properly bridged)
- Member hangs up because they just hear the announce
- Queue fails → fallback routes to the configured timeout destination (ext 1003 = AI Agent)
Fix¶
Dialplan generator now emits Answer() + Wait(0.5) unconditionally before Queue():
exten => 5003,n,Set(CHANNEL(musicclass)=org_mnd5khym__default)
exten => 5003,n,Answer() ; always
exten => 5003,n,Wait(0.5) ; settle before Queue() runs
exten => 5003,n,Queue(org_mnd5khym_5003,ct,,,45)
Hot-fixed prod live, then committed to main so future config regenerations preserve it.
Prevention¶
Always Answer() before Queue() regardless of greeting. If a greeting exists, play it after the answer. Asterisk will not play MOH or ring members properly on an unanswered channel.
Error 32: Staging Inbound Forwarding Failed — PJSIP auth ids available¶
Date: 2026-04-17
Symptoms¶
After the monorepo cutover, calls to +91 80659 78001 (routing_environment='staging') were reaching prod's dispatcher and getting Dial(PJSIP/918065978001@cloud-endpoint-stage) — but ringing then failing with:
ERROR[...] res_pjsip_outbound_authenticator_digest.c:
cloud-endpoint-stage:10.10.10.3: There were no auth ids available
Root Cause¶
Staging's tata_gateway_identify PJSIP rule only matched 10.10.10.2 (NUC). When prod started forwarding from 10.10.10.1 (its WireGuard IP), staging didn't recognise the source → treated the INVITE as unauthenticated.
Fix¶
Added match=10.10.10.1 to staging's tata_gateway_identify in /etc/asterisk/pjsip_tata_gateway.conf:
[tata_gateway_identify]
type=identify
endpoint=tata_gateway
match=10.10.10.2
match=10.10.10.1 ; prod cloud can now forward calls in
Reload with pjsip reload.
Prevention¶
Any new source IP that forwards calls into a staging/prod Asterisk must be added to the relevant identify rule. Document source IPs per tunnel.
Error 33: NUC Clobbers Outbound Caller ID to 78001¶
Date: 2026-04-17
Severity: High — customer-visible (recipients saw wrong caller ID for GE outbound)
Symptoms¶
Customers receiving calls from GrandEstancia's agents saw +91 80659 78001 (Hari Surya's number) instead of GE's 08065978002. This was since the monorepo cutover.
Root Cause¶
The NUC's from-cloud context validates the incoming caller ID against our owned Tata range 918065978000-029. If the CID is outside that range, it falls back to +918065978001 (the platform default).
GE's dialplan set CALLERID(num)=08065978002 (Indian local format, 11 digits starting 0). NUC's range check was:
→ fell through to default +918065978001.
Fix¶
Added a format normalisation step BEFORE the range check in NUC's from-cloud:
exten => _X.,n,GotoIf($["${LEN(${IN_CID})}" = "11" & "${IN_CID:0:1}" = "0"]?cid_normalize:cid_skip_norm)
exten => _X.,n(cid_normalize),Set(IN_CID=91${IN_CID:1})
exten => _X.,n(cid_skip_norm),GotoIf(...) ; existing range check
Now 08065978002 → 918065978002 → in range → passes as +918065978002. Recipients see the right number.
Prevention¶
Outbound CALLERID should ideally always be in E.164 format before it reaches the NUC. The NUC is defence-in-depth. Long-term: normalise CID in the per-user dialplan generator so NUC never has to translate.
Error 34: Admin UI "Environment" Dropdown Silently Ignored¶
Date: 2026-04-17
Symptoms¶
Admin selects Prod / Staging / OSS in the DID Management /admin/dids dropdown → UI shows success → nothing changes in the dispatcher → reload the page and the value reverts.
Root Cause¶
PUT /api/v1/did-pool/admin/:id has an explicit allowed-fields list:
routing_environment was missing → the PUT returned 200 but never persisted the field.
Fix¶
Added routing_environment to the allowed list. Pattern note: every new column that admin UIs can edit must be explicitly allowed in this endpoint's filter.
Prevention¶
When adding a new editable column, grep for the model name in route files and add the column to all allowed-field filters. Consider deriving the allowed list from the model schema automatically.
Error 35: Dispatcher Generator Required an Org for Every DID¶
Date: 2026-04-17
Symptoms¶
To route a staging DID (e.g., +91 80659 78001) from prod to staging, prod's DB needed an org that owned the DID. This forced us to keep shell/test orgs on prod just to preserve inbound flow to staging — couldn't clean up prod's DB.
Root Cause¶
Generator loop skipped DIDs without an organization include:
Fix¶
Collect orphan DIDs separately and emit Dial(PJSIP/<did>@cloud-endpoint-stage) for those marked routing_environment='staging'. Non-staging orphans are logged as a warning and skipped.
Prevention¶
Keep the generator tolerant of partial data. The alternative — requiring a complete org graph — couples prod's tenant list to staging's forwarding needs, which is the wrong direction of dependency.
Error 36: Cross-Org Call Data Leak via Unscoped ai-outbound Clause¶
Date: 2026-04-18 Severity: Critical (cross-tenant data leak) Exposure: Platform-authenticated admin users of one tenant could see another tenant's AI-outbound call metadata (src, dst, time, duration, disposition). No external exposure.
Symptoms¶
Within minutes of approving a brand-new org ("Zauto AI", org_id 728e57ec-…, zero call activity), the owner's dashboard Recent Calls panel listed GrandEstancia's AI outbound calls (from 08065978002 to various 9xxxxxxxxx numbers, timestamps matching GE's actual activity).
Root Cause¶
GET /api/v1/calls and GET /api/v1/calls/history used this WHERE clause for org ownership:
The OR t.dcontext = 'ai-outbound' was unscoped — it matched AI-outbound rows regardless of which org originated them. Any authenticated org admin querying /calls received every AI-outbound CDR platform-wide.
Verified on the leaked rows: accountcode = ba50c665-… (GE's id), but the unconditional dcontext clause let them match Zauto's query.
Fix¶
Dropped the unscoped clause. AI-outbound rows are still matched via accountcode, which the workflow engine always sets on Originate:
PRs #54 (staging) + #55 (prod) — deployed 2026-04-18 10:53 UTC.
Verification¶
curl /api/v1/calls?org_id=<zauto-id>→total: 0(correct — brand-new org, no calls)curl /api/v1/calls?org_id=<ge-id>→total: 457(GE's own calls, unchanged)
Prevention¶
- Any future CDR query must org-scope every OR branch in the WHERE clause. Never add a fallback that widens org ownership to "everyone for this dcontext/route/endpoint type".
- Follow-up audit required for:
/api/v1/calls/:linkedId/journey(currently unscoped),/api/v1/calls/:callId/recording(scoped via CDR row lookup, but worth re-verifying), and any future endpoint touchingasterisk_cdr. - Consider a helper
orgScopedCdrWhere(orgId, prefix)to centralise the predicate so it can't drift per-endpoint.
Error 37: Queue Save Returns 500 — timeout_destination_type: "phone" Not in DB ENUM¶
Date: 2026-04-21
Symptoms¶
PUT /api/v1/queues/:id returns 500. The queue form in the editor has "Phone Number" selected as the Timeout Destination type. Other queues save fine.
Root Cause¶
The editor's queue form offers "phone" as a timeout_destination_type option. The queues table ENUM only allowed ('extension','queue','ivr','external','hangup'). When the editor sent timeout_destination_type: "phone", Sequelize's queue.update() threw a DB ENUM constraint error, which the catch block converted to a 500 response.
The dialplan generator (dialplanGenerator.js:748) already handled "phone" correctly — only the DB schema was missing it.
Fix¶
ALTER TABLE queues MODIFY timeout_destination_type
ENUM('extension','queue','ivr','external','hangup','phone') NULL DEFAULT NULL;
Applied directly on prod MariaDB. No restart needed.
Prevention Rule¶
Whenever the editor adds a new ENUM value to any form dropdown, check that the corresponding DB ENUM column includes that value before shipping. The dialplan generator and API allowedFields must all agree with the DB schema.
Error 38: MOH Upload Returns 413 Request Entity Too Large¶
Date: 2026-04-21
Symptoms¶
POST /api/pbx/moh/upload returns 413. The MOH upload dialog in the editor fails immediately when a user selects an audio file.
Root Cause¶
The nginx client_max_body_size defaults to 1 MB when unset. Audio files for MOH (.wav, .mp3) are typically 2–10 MB. Nginx rejects the upload before the request reaches astrapbx.
Fix¶
Added client_max_body_size 50M; to the /api/pbx/ location block in /etc/nginx/sites-enabled/editor.astradial.com:
location ~ ^/api/pbx/(.*) {
client_max_body_size 50M;
set $upstream_uri /api/v1/$1;
proxy_pass http://127.0.0.1:8000$upstream_uri$is_args$args;
...
}
Reload: systemctl reload nginx
Prevention Rule¶
Any nginx location block that proxies file-upload endpoints must have an explicit client_max_body_size. Default 1 MB is never appropriate for audio, image, or document uploads.
Error 39: Greetings TTS Play Button Missing — audio_file is NULL¶
Date: 2026-04-21
Symptoms¶
A greeting is created successfully (shown in the list), but no play button appears. The greeting's audio_file column is NULL in the DB.
Root Cause¶
The POST /api/v1/greetings endpoint only created the DB record; it never called TTSService.saveGreetingAudio(). The TTSService existed at src/services/ttsService.js but was not wired into the creation endpoint.
The editor hides the play button with {g.audio_file && <Button ...>}, so it simply doesn't render for greetings without audio.
Fix¶
Updated POST /api/v1/greetings in server.js to call TTSService.saveGreetingAudio() immediately after creating the DB record:
audio_file = await tts.saveGreetingAudio(id, text, language, voice);
greeting = await Greeting.create({ id, ..., audio_file });
TTS failure is caught silently so the greeting is still created (without audio) if Google TTS is unavailable.
Also added: - PUT /api/v1/greetings/:id — updates metadata; if text/voice/language changes, regenerates TTS audio - DELETE /api/v1/greetings/:id — deletes DB record and audio file
Existing greetings with audio_file: NULL can be fixed by calling PUT with the same text to trigger a regeneration.
Prevention Rule¶
Any endpoint that creates a resource with an associated file must generate that file in the same request. Don't create the DB record first and rely on a follow-up step — if the follow-up fails, the record is permanently orphaned without an obvious error.
Error 40: IVR Greeting Silently Fails — Call Hangs Up with No Audio¶
Symptom¶
Inbound call to a DID routed to an IVR connects, but no greeting plays and the call drops after a few seconds.
Diagnosis¶
grep Background /etc/asterisk/ext_<org>.conf | head
# If you see: Background(greeting_ivr_<uuid>) ← WRONG (bare filename)
# Should be: Background(greetings/greeting_ivr_<uuid>)
ls /var/lib/asterisk/sounds/greetings/greeting_ivr_<uuid>.wav
# File must exist.
Root Cause¶
TTSService.saveGreetingAudio() writes to /var/lib/asterisk/sounds/greetings/<prompt>.wav, but Asterisk's Background() only searches the language subdir under astsoundsdir (e.g. /var/lib/asterisk/sounds/en/) — NOT the greetings/ subdir. A bare filename fails silently; Asterisk logs no error, just moves past.
Compound cause: if the user set greeting_text via the UI but "Generate greeting" failed or wasn't clicked, the DB has greeting_prompt set but no .wav file exists.
Fix¶
Fixed in api/src/services/asterisk/dialplanGenerator.js:569:
After deploying the code fix, regenerate each org's dialplan via configDeploymentService.deployOrganizationConfiguration(orgId, name).
To regenerate a missing .wav, run TTS directly or have the user click "Generate greeting" in the IVR UI:
Error 41: SIP Phone → IVR Extension Returns SIP 404¶
Symptom¶
Zoiper or other softphone registered against the PBX dials an IVR extension (e.g. 7002) and gets Not Found (code: 404). The IVR works for external callers via DID routing but not from internal SIP.
Diagnosis¶
asterisk -rx "dialplan show 7002@<prefix>__internal"
# If: "There is no existence of context 'X'" or only matches _X. wildcard,
# the _ivr context is not included.
grep -A 5 "^\[<prefix>__internal\]" /etc/asterisk/ext_<org>.conf
# Should show: include => <prefix>__ivr
Root Cause¶
generateInternalContext historically only included _outbound and _queue. IVR extensions only exist in _ivr, so internal SIP callers had no way to reach them.
Fix¶
Fixed in dialplanGenerator.js:114-120 — added include => <prefix>_ivr AND reordered includes so exact-match contexts come first (see Error 47). Regenerate org dialplan after deploy.
Error 42: pjsip Endpoint Rejected — Could not find option suitable for category¶
Symptom¶
Outbound calls fail with pjsip error. Asterisk log shows:
ERROR[...] config_options.c: Could not find option suitable for category
'<endpoint-name>' named 'system_trunk' at line N of pjsip_<org>.conf
ERROR[...] res_sorcery_config.c: Could not create an object of type
'endpoint' with id '<endpoint-name>' from configuration file 'pjsip.conf'
asterisk -rx "pjsip show endpoint <endpoint-name>" returns "Unable to find object". Any Dial(PJSIP/num@<endpoint-name>) dies.
Diagnosis¶
mariadb ... -e "SELECT name, configuration FROM sip_trunks WHERE asterisk_peer_name='<endpoint-name>'"
# If configuration contains non-pjsip keys like system_trunk, nuc_gateway,
# channels, max_channels, routing_environment, notes — that's the bug.
Root Cause¶
sipTrunkService.js previously splatted EVERY key from the sip_trunks.configuration JSON column verbatim as a pjsip endpoint option. When ops added metadata fields to that column (intended for admin UI display), pjsip rejected the whole endpoint because system_trunk=true is not a valid option.
Fix¶
Fixed in api/src/services/asterisk/sipTrunkService.js:109-131 — added a METADATA_KEYS deny list:
const METADATA_KEYS = new Set([
'system_trunk', 'nuc_gateway', 'channels', 'max_channels',
'routing_environment', 'notes',
]);
if (trunk.configuration && typeof trunk.configuration === 'object') {
Object.entries(trunk.configuration).forEach(([key, value]) => {
if (METADATA_KEYS.has(key)) return;
config += `${key}=${value}\n`;
});
}
After code deploy, regenerate the affected org's pjsip config and asterisk -rx "pjsip reload".
Error 43: Admin Impersonation — Users Empty, No Auto-Logout After 24h¶
Symptom¶
Admin impersonates an org, comes back 24h+ later. Dashboard loads but shows empty users list, zero call stats, no "Session expired" redirect. API calls return 401 silently.
Diagnosis¶
In browser devtools on editor.astradial.com:
localStorage.getItem('pbx_org_token_exp') // compare to Date.now() — expired?
localStorage.getItem('gateway_admin_key') // truthy = admin session active
JSON.parse(localStorage.getItem('org_access')).impersonating // true = admin impersonating
If all three signal impersonation-with-expired-JWT but no redirect fires, the watcher isn't handling this case.
Root Cause¶
AuthExpiryWatcher and handleUnauthorized both bailed when any admin key was present, ignoring the fact that an impersonating admin ALSO has a PBX JWT that expires in 24h. The impersonation JWT silently expired, 401s were swallowed by the catch handlers, and the UI showed empty data.
Fix¶
Fixed in PR #62:
AuthExpiryWatcher.scheduleFromStorage()now schedules a timer wheneverpbx_org_tokenexists, regardless of admin-key state.handleUnauthorizeddistinguishes three session types:- Normal org user → full logout + Firebase signOut.
- Admin impersonating → clear ONLY impersonation state (
pbx_org_token*,org_access,user_role,user_permissions), redirect to/dashboard. Keep Firebase +gateway_admin_keyintact. - Pure admin (no JWT) → swallow 401 as before (admin auth uses a different mechanism).
Also added admin_session_start stamped at admin login + handleAdminSessionExpiry for a separate 24h admin-session auto-logout that DOES sign out of Firebase.
Error 44: Outbound E.164 Number with + Prefix — "Extension Not Found"¶
Symptom¶
Softphone (Zoiper, Bria, etc.) dials +919944421125. Asterisk logs:
Call (UDP:.../...) to extension '+919944421125' rejected because
extension not found in context '<prefix>__internal'.
Dialling the same number without + (919944421125) works.
Root Cause¶
Asterisk pattern matching treats + as a literal character, not a digit. _X. in outbound context matches 9944421125 (starts with digit) but does NOT match +919944421125 (starts with +). No rule matched → 404.
Fix¶
Fixed in dialplanGenerator.js generateOutboundContext — added a catch-all pattern at the top of the outbound context:
This matches any +<digits>, strips the +, and re-enters the dialplan at the same context with the digits-only form. Internal context includes _outbound, so SIP phones get it for free.
Error 45: Staging Outbound Calls Get Congestion / 403 Forbidden¶
Symptom¶
SIP phone on staging dials a PSTN number. Staging logs show the call reaches Dial(PJSIP/num@<trunk>,60), sends INVITE to prod (10.10.10.1:5060) over WireGuard, then Asterisk reports "Everyone is busy/congested at this time (1:0/0/1)" and returns 403 to the caller. Prod logs show cloud-endpoint-stage matched the incoming INVITE but the call dies in the from-cloud context.
Diagnosis¶
# On prod:
cat /etc/asterisk/ext_from_cloud.conf
# Check the Goto target — must be a context that exists.
asterisk -rx "dialplan show <target-context>"
# If "There is no existence of context" — that's the bug.
Root Cause¶
Prod's ext_from_cloud.conf Goto'd a per-org outbound context (org_mna9x47k__outbound) that was never generated on prod because the corresponding org was never provisioned there. Asterisk silently reported congestion because Dial couldn't resolve anything.
Fix¶
Prod hand-edit (documented here because ext_from_cloud.conf is not in the monorepo — see Prod Direct-Edit):
[from-cloud]
exten => _X.,1,NoOp(Staging Cloud Outbound: ${EXTEN} from ${CALLERID(all)})
same => n,Set(CALLERID(num)=+918065978001)
same => n,Set(CALLERID(name)=AstraPrivate)
-same => n,Goto(org_mna9x47k__outbound,${EXTEN},1)
+same => n,Goto(staging-outbound,${EXTEN},1)
staging-outbound (in ext_staging_outbound.conf) correctly dials PJSIP/${EXTEN}@tata_gateway. Hot-reload via asterisk -rx "dialplan reload".
Prevention Rule¶
Any Goto target in a hand-maintained Asterisk conf must be verified via dialplan show <target> before reload. Asterisk does NOT validate Goto targets at load time — bad targets only surface at call time as congestion.
Error 46: Dialplan Regen Fails — Unknown column 'ivrs.greeting_text'¶
Symptom¶
configDeploymentService.deployOrganizationConfiguration throws when called on prod or a target env:
Staging works fine for the same code; only this env fails.
Root Cause¶
Schema drift. The Sequelize model's SELECT includes columns that were added to staging via migration-like ALTER commands but never propagated to the target env. Idempotent ALTERs in migration files (ADD COLUMN IF NOT EXISTS) are safe to re-run.
Diagnosis¶
# Diff schemas between healthy env and broken env:
ssh root@healthy 'mariadb ... -e "DESCRIBE ivrs"' > /tmp/healthy.txt
ssh root@broken 'mariadb ... -e "DESCRIBE ivrs"' > /tmp/broken.txt
diff /tmp/healthy.txt /tmp/broken.txt
Fix¶
Re-run the relevant migration file (all IVR migrations are idempotent):
scp api/database/migrations/<migration>.sql root@<vps>:/tmp/
ssh root@<vps> 'mariadb -u<user> -p<pw> pbx_api_db < /tmp/<migration>.sql'
Prevention Rule¶
Before deploying code that depends on a schema change, diff schema between the healthy env and the target env and re-run any missing migrations first. CI/CD does NOT run migrations — it's deliberately a human step, precisely because auto-migration of an out-of-date prod is a footgun.
Error 47: Queue/IVR Dialed from Internal SIP Silently Goes to PSTN¶
Symptom¶
SIP phone on <org>_internal context dials a known queue number or IVR extension (e.g. 5001 or 7002). Instead of reaching the queue/IVR, the call goes OUT to PSTN with the number as destination. Asterisk happily bills the minute against the trunk.
Diagnosis¶
asterisk -rx "dialplan show <number>@<prefix>__internal"
# If the match shown is '_X.' from _outbound context, the wildcard is
# winning over the exact match in _queue or _ivr.
grep "include => " /etc/asterisk/ext_<org>.conf
# If order is: _outbound then _queue then _ivr — that's the bug.
Root Cause¶
Asterisk searches includes in declaration order and returns on the first include that has a match. Specificity (exact vs pattern) works WITHIN a single include but does NOT cross include boundaries. If _outbound is first, its _X. catches any digit sequence before _queue/_ivr get searched — even though those have exact matches.
Fix¶
Fixed in dialplanGenerator.js:114-120 — reordered to put exact-match contexts first:
include => <prefix>_ivr # exact IVR extensions
include => <prefix>_queue # exact queue numbers
include => <prefix>_outbound # _X. wildcard — last
Historical detail: prod orgs that had queues (Om Chamber 5001, etc.) had been silently broken for internal-dial-to-queue since launch. Nobody reported it because users normally reach queues via DID routing, not internal dial. The fix corrects the behaviour; no config change is needed on the call-flow side.
Prevention Rule¶
In every include chain, wildcard-match contexts must come LAST. Put exact-match (queue/IVR) before _X.-style patterns. Same rule applies to any new context types added later.
Error 48: DID Edit Dialog — IVR Dropdown Missing / Free-Text Input Deselects Per Keystroke¶
Symptom¶
Two tightly linked UI bugs in the DIDs edit dialog on the editor:
- Selecting
routing_type=ivrshows a free-text Input instead of a dropdown listing IVRs. Users type the IVR's extension number, save, then calls to the DID return "number not in service". - Typing in any free-text Destination input (external / ai_agent) deselects the cursor after every single keystroke.
Root Cause¶
Both bugs share a root cause: DestinationField was declared INSIDE the DidsPage component function. Every render of DidsPage creates a new function reference; React treats each render's DestinationField as a new component and remounts the <Input> / <Select>, dropping focus.
The IVR dropdown was missing entirely — no ivr branch in the conditional. Users fell through to the free-text Input and typed the extension. The dialplan generator looks up IVRs by UUID (org.ivrs.find(i => i.id === did.routing_destination)), so extension-number strings never match and calls fail with number-not-in-service.
Fix¶
Fixed in editor/app/dashboard/[orgId]/dids/page.tsx:
- Hoisted
DestinationFieldout ofDidsPageto a top-level component withuserList/queueList/ivrListpassed as props (not closures). - Added
ivrbranch:if (routingType === "ivr") { return ( <Select value={value} onValueChange={onChange}> <SelectTrigger><SelectValue placeholder="Select IVR" /></SelectTrigger> <SelectContent> {ivrList.filter(i => i.status === "active").map((i) => ( <SelectItem key={i.id} value={i.id}>{i.extension} — {i.name}</SelectItem> ))} </SelectContent> </Select> ); } - Added
displayDestinationhelper so the DIDs table shows "<ext> — <name>" instead of the raw UUID.
Prevention Rule¶
Never declare a sub-component inside another component function. Either (a) hoist it outside, (b) wrap it in useCallback/useMemo with correct deps, or (c) accept that it'll remount on every parent render — only acceptable for stateless read-only presentations.
Error 49: Upptime CI — Cannot read properties of undefined (reading 'tag_name')¶
Symptom¶
Roughly half of the scheduled Uptime CI, Response Time CI, and Graphs CI runs in astradial/upptime fail within ~15 seconds with:
ERROR TypeError: Cannot read properties of undefined (reading 'tag_name')
at getUptimeMonitorVersion (.../uptime-monitor/v1.41.0/webpack:/@upptime/uptime-monitor/dist/helpers/workflows.js:17)
The other half succeed. status.astradial.com updates intermittently, with gaps in coverage.
Root Cause¶
upptime/uptime-monitor@v1.41.0 calls octokit.repos.listReleases({ owner: "upptime", repo: "uptime-monitor", per_page: 1 }) and accesses releases.data[0].tag_name with no nil-check. The GitHub API endpoint GET /repos/upptime/uptime-monitor/releases flaps — sometimes it returns the release list, sometimes []. Hand-verified: same auth, two consecutive calls, one full payload and one empty array. When the action lands on the empty response, it crashes.
Upstream upptime/uptime-monitor has issues disabled and last released 2025-09-04; the project is effectively unmaintained.
Diagnosis¶
Reproduce the flap directly:
for i in 1 2 3 4 5; do
echo "Try $i:"
gh api 'repos/upptime/uptime-monitor/releases?per_page=1' --jq '.[0].tag_name // "EMPTY"'
done
Inspect the bundled helper that crashes:
gh api repos/upptime/uptime-monitor/contents/dist/helpers/workflows.js \
--jq '.content' | base64 -d | head -25
Fix¶
Forked the action to astradial/uptime-monitor and patched the bundled dist/index.js directly (single line, surgical edit — no rebundle):
// Before (around line 1214):
release = releases.data[0].tag_name;
// After:
try {
release = releases.data[0]?.tag_name ?? "v1.41.0";
} catch {
release = "v1.41.0";
}
Shipped as v1.41.3-astradial. All 5 workflows in astradial/upptime pin:
We also keep the matching patch in src/helpers/workflows.ts so a future rebundle picks it up — but the bundle is the source of truth at runtime.
Tag history (read this before re-touching the fork)¶
| Tag | What it shipped | State |
|---|---|---|
v1.41.1-astradial | Patched src/helpers/workflows.ts only — never rebundled. dist/index.js still ran the unpatched code. | Broken — kept crashing on empty listReleases. |
v1.41.2-astradial | Bumped action.yml runtime node20 → node24 to clear the deprecation warning. Did not rebundle. | Broken — node_libcurl.node mismatch (NODE_MODULE_VERSION 115 vs 137). Reverted via astradial/upptime#7. |
v1.41.3-astradial (first attempt) | Re-bundled dist/index.js from scratch with ncc on macOS + npm install --ignore-scripts. ncc embedded a @@notfound.js stub for node_libcurl.node because the post-install never compiled the binary on macOS. | Broken — Cannot find module '../lib/binding/node_libcurl.node'. Reverted same hour. |
v1.41.3-astradial (current) | Surgical patch. Reset to v1.41.1 commit, edited the single offending line in the existing dist/index.js to add ?. + ?? + try/catch. No rebundle, no rebuild — original node_libcurl.node stays untouched. | Working. |
Prevention Rules¶
- GitHub Actions runs the bundle, not the source. A
src/-only patch is silently a no-op until someone rebuilds. Either patchdist/<entry>.jsdirectly (surgical edit), or do a full rebundle — but if rebundling, you must include the prebuilt native modules (see rule 2). - Don't rebundle this action on macOS without first compiling
node-libcurlfor Linux. The bundle will only work on the GitHub Actions Linux runner ifnode_modules/node-libcurl/lib/binding/node_libcurl.nodeexists at bundle time — otherwise ncc inserts a@@notfound.jsstub and the action throwsCannot find moduleat runtime.npm install --ignore-scriptsskips the binary, so a clean rebundle from macOS is broken-by-default. For small fixes, prefer a surgical edit of the existingdist/index.js(we did this forv1.41.3-astradial). - Don't rev
action.yml runs.usingwithout recompilingnode_libcurl. The.nodefile indist/lib/binding/is ABI-locked to a specific Node major. Bumpingnode20→node24without rebuilding the binary against node24 yields theNODE_MODULE_VERSIONerror above. The Node 20 deprecation warning is fine to live with until someone gets a node24-compiled prebuilt. - Treat upstream actions with issues-disabled + dormant repos as unmaintained. Fork before adopting for production-impacting work; pin to your fork's tag.
Verification¶
After bumping the pin, manually trigger a run and watch the logs:
A green run with the 🔼 Upptime @v1.41.0 banner in the generated workflow header confirms the fallback path is working (since getUptimeMonitorVersion now returns the literal "v1.41.0" when the API call returns empty).
Error 50: Editor "Add Trunk" Form Has No Password Field — Inbound Trunk Save Silently 400s¶
Symptom¶
In editor.astradial.com → org → Trunks → + Add Trunk, you fill in Name, Host, Username, pick Inbound as the trunk type, click Create, and:
- The dialog closes
- A toast briefly flashes "Failed to create" (easy to miss)
- The trunk does NOT appear in the list
- Some users report being "logged out" — actually a navigation glitch where the editor lands on a different org's overview page after the failure
Root Cause¶
The editor's Create Trunk dialog (editor/app/dashboard/[orgId]/trunks/page.tsx) was missing the Password input entirely:
- The form's React state declared
password: ""but no<Input>rendered it. handleCreate()did not passpasswordin the API call.
The API requires it — api/src/server.js:1357:
if (trunk_type === 'inbound' && (!username || !password)) {
return res.status(400).json({ error: 'Username and password are required ...' });
}
So inbound and outbound trunk creation always 400'd. peer2peer trunks (the only existing type at the time, e.g. the Tata trunk per org) didn't need a password, so the bug went unnoticed until VSEVEN HOTELS became the first customer to use an inbound trunk to connect their on-prem UCM6301 to the cloud.
Diagnosis¶
Cloud access log will show repeated POST /api/v1/trunks returning 400:
ssh root@89.116.31.109 'tail -200 /root/.pm2/logs/astrapbx-out.log | \
grep -E "POST /api/v1/trunks.*400"'
DB will have no new row for the trunk:
SELECT id, name, trunk_type FROM sip_trunks WHERE org_id='<org-id>'
ORDER BY created_at DESC LIMIT 5;
Fix¶
Add a Password input to the Create Trunk dialog with show/hide toggle and a Generate button (32-char hex via WebCrypto). Send password in the API call. After successful create, open a Credentials dialog showing server, port, transport, username, password (masked) with copy buttons. Add a "View credentials" entry in the row dropdown that GETs /trunks/:id and re-opens the same dialog.
Files: editor/lib/pbx/client.ts (add password?: string to PbxTrunk type), editor/app/dashboard/[orgId]/trunks/page.tsx (form + dialog).
Branch: fix/trunks-form-password-field (commit a0be1d0).
Prevention Rule¶
When adding new trunk types or other multi-shape API endpoints, always smoke-test the editor flow against ALL trunk types before shipping, not just the default. The bug existed since the trunks page was first written but was invisible until a customer hit the non-default branch.
For new endpoints whose form-required fields differ by trunk_type (or any other discriminator), prefer: - Conditional rendering based on the discriminator, OR - A single shared form that always sends every field, with the API ignoring irrelevant ones — much harder to silently break.
Error 51: Editor User Routing — ring_target=phone / routing_type=ai_agent Silently Reverts to Defaults¶
Symptom¶
In the editor's org → Users page, you create a user with Routing → Phone + a phone number, or edit an existing user to switch routing. After Save:
- The user appears in the list (or the existing user looks updated for a moment)
- After the list refreshes, the routing column shows SIP again
- The phone number field appears empty
- For the edit flow, a toast briefly shows "Failed to update"
Root Cause¶
Two independent bugs that together produced one symptom. The CREATE flow had one bug, the EDIT flow had a different one.
Bug A — POST /api/v1/users dropped routing fields. The handler destructured only:
phone_number, ring_target, routing_type, routing_destination were never read from the request body, never passed to User.create(), so the model defaults applied (ring_target='ext', routing_type='sip').
Bug B — PUT /api/v1/users/:id/routing did not exist on the API. The editor's handleEdit calls two endpoints in sequence:
await users.update(editUser.id, { full_name, email, extension, role, outbound_did, password });
await users.updateRouting(editUser.id, { routing_type, routing_destination, ring_target, phone_number });
The first PUT was fine (didn't touch routing fields). The second hit PUT /users/:id/routing — an endpoint the API never had — so it returned 404. The 404 caused handleEdit to throw and loadUsers() to never run, leaving the UI showing the pre-edit state.
Why this lay dormant for so long¶
The only customer using ring_target=phone before VSEVEN HOTELS was GrandEstancia. Per docs/guides/grandestancia-setup.md, all GrandEstancia routing changes were made via direct curl to PUT /api/v1/users/{userId} (the main update endpoint, which was correctly mapping ring_target and phone_number from allowedFields), not via the editor UI. So the editor's edit-routing path had never been exercised by a real customer until VSEVEN.
Diagnosis¶
Access log shows the dual-call pattern with the second 404'ing:
DB row stays unchanged:
SELECT id, extension, routing_type, ring_target, phone_number, updated_at
FROM users WHERE id='<uuid>';
For Bug A specifically: a user just created appears in the response with routing_type='sip' and ring_target='ext' regardless of what the client sent.
Fix¶
Bug A — POST /api/v1/users: destructure the routing fields from the body and pass them to User.create() only when the client explicitly sent them (so omission preserves model defaults rather than overwriting with undefined):
const {
extension, username, password, full_name, email, role = 'agent',
phone_number, ring_target, routing_type, routing_destination,
} = req.body;
// ...
const user = await User.create({
/* ...existing fields... */
...(phone_number !== undefined && { phone_number }),
...(ring_target !== undefined && { ring_target }),
...(routing_type !== undefined && { routing_type }),
...(routing_destination !== undefined && { routing_destination }),
});
Also extend PUT /api/v1/users/:id allowedFields with routing_type, routing_destination (it already had ring_target, phone_number).
Bug B — Add the missing PUT /api/v1/users/:id/routing endpoint, mirroring the existing PUT /api/v1/dids/:id/routing pattern:
app.put('/api/v1/users/:id/routing', authenticateOrg, async (req, res) => {
const user = await User.findOne({ where: { id: req.params.id, org_id: req.orgId } });
if (!user) return res.status(404).json({ error: 'User not found' });
const { routing_type, routing_destination, ring_target, phone_number } = req.body;
const updateData = {};
if (routing_type !== undefined) updateData.routing_type = routing_type;
if (routing_destination !== undefined) updateData.routing_destination = routing_destination || null;
if (ring_target !== undefined) updateData.ring_target = ring_target;
if (phone_number !== undefined) updateData.phone_number = phone_number || null;
// ring_target='phone' without phone_number is unreachable — fail loudly.
const finalRingTarget = updateData.ring_target ?? user.ring_target;
const finalPhoneNumber = updateData.phone_number ?? user.phone_number;
if (finalRingTarget === 'phone' && !finalPhoneNumber) {
return res.status(400).json({ error: 'phone_number is required when ring_target is "phone"' });
}
await user.update(updateData);
const { password_hash, sip_password, ...userData } = user.toJSON();
res.json(userData);
});
Branch: fix/users-routing-fields-not-saved (commits 8794837 for Bug A, 802f233 for Bug B).
Prevention Rules¶
- When the editor calls a new API endpoint, grep the API to confirm it actually exists. A
404in the access log is the only evidence this kind of bug ever happened — easy to miss if you're not looking. req.bodydestructuring with hardcoded fields silently drops everything else. Whenever a model gains optional fields, every endpoint that creates/updates that model needs to be audited for whether it forwards those fields. Better still: use a single allowlist constant and have both POST and PUT iterate it.- Smoke-test create AND edit flows before declaring a feature done. Bug A would have been caught by trying to create a phone-routed user, Bug B by trying to edit one to phone-routed. We had docs that documented this feature working (GrandEstancia) but only via curl to the main PUT — so the docs themselves masked the editor bug.
- For paired endpoints like
PUT /resource/:idandPUT /resource/:id/routing: keep them next to each other inserver.jsand document that they're paired so the next person who removes one notices the other.
Error 52: SIP REGISTERs silently dropped after PJSIP reload — fail2ban storm masks it¶
Symptom¶
After a PROD config push, multiple customer softphones across multiple orgs report SIP 408 Request Timeout. tcpdump on the PROD VPS confirms the REGISTERs arrive at eth0, but Asterisk sends no response and /var/log/asterisk/full.log has no entry for the attempts. The PROD VPS appears completely unresponsive to new registrations while existing long-lived ones (e.g., a customer PBX over WireGuard) keep working.
Root Cause¶
Two stacked failures producing one symptom:
1. PJSIP module left in a stuck state by back-to-back reloads. The dialplan generator triggered module reload res_pjsip.so twice within ~1 second (19:14:39 then 19:14:40). PJSIP transports are not fully reloadable — Asterisk logs:
NOTICE: Transport 'transport-udp' is not fully reloadable, not reloading:
protocol, bind, TLS, TCP, ToS, or CoS options.
The second reload arrived while the first was still processing. The end state was a sorcery cache where new endpoints/auths weren't fully wired into the distributor, so PJSIP silently failed to dispatch incoming REGISTER messages to any matched endpoint. Existing in-memory contacts continued working; new registrations went into a black hole.
2. fail2ban then banned the customers retrying. With registrations failing, every softphone on the customer side started retrying every 4 seconds. Each first-REGISTER (sent without auth, expected to receive a 401 challenge) is logged by Asterisk as:
NOTICE: Request 'REGISTER' from '<sip:...>' failed for '<HOST>' (callid: ...)
- No matching endpoint found
This is the regular handshake log line — fail2ban matches it against the asterisk filter's failregex and counts it as a "failed registration". With maxretry=3 / findtime=600 / bantime=86400, three retries inside 10 minutes triggered a 24-hour ban. So even after the PJSIP state was unstuck, banned IPs kept getting silently dropped at the iptables layer with no further log entry — making it look like the underlying bug was still present.
Diagnosis¶
# 1) Confirm packets reach eth0 (rules out network / GeoIP firewall)
tcpdump -ni eth0 'host <CUSTOMER_IP> and udp port 5080'
# 2) Confirm Asterisk doesn't see them (no log entries)
grep '<CUSTOMER_IP>' /var/log/asterisk/full.log | tail -10
# 3) Check fail2ban bans
fail2ban-client status asterisk
# Banned IP list: <CUSTOMER_IP> ...
# 4) Check pjsip state freshness — uptime vs last reload vs disk mtime
asterisk -rx 'core show uptime' # 'Last reload: ...'
ls -la /etc/asterisk/pjsip*.conf # mtime of disk configs
# If disk mtime > last reload time → reload didn't happen
# If disk mtime ~= last reload time → reload happened, may have been double-fire
Fix¶
# 1) Clear PJSIP stuck state
ssh root@89.116.31.109
asterisk -rx 'module reload res_pjsip.so'
# Verify: tail /var/log/asterisk/full.log for any ERROR/WARNING
# 2) Unban affected customer IPs (one at a time per the chosen scope)
fail2ban-client set asterisk unbanip <CUSTOMER_IP>
# Or for the whole jail (use only if you intend to unban everyone):
fail2ban-client unban --all asterisk
# 3) Tune fail2ban so legitimate retry-flood doesn't keep re-banning
sudo cp /etc/fail2ban/jail.local /etc/fail2ban/jail.local.bak-$(date +%F)
sudo sed -i '/^\[asterisk\]/,/^\[/{ s/^maxretry = 3$/maxretry = 10/ }' /etc/fail2ban/jail.local
fail2ban-client reload asterisk
fail2ban-client get asterisk maxretry # → 10
Prevention Rules¶
- Avoid back-to-back PJSIP reloads. If the API/generator triggers a reload, debounce so two writes within N seconds collapse to one reload. Specifically: when multiple dialplan files are written in sequence (e.g., during a multi-org regeneration), do all writes first, then reload once.
- Prefer
asterisk -rx 'pjsip reload'overmodule reload res_pjsip.so. The former reloads only PJSIP config; the latter restarts the whole module which can race with active requests. - fail2ban filter regex matches
No matching endpoint found— which fires on every legitimate first-REGISTER. With defaultmaxretry=3, normal users on flaky networks trip the ban during a retry storm. PROD ranmaxretry=10as an interim measure from 2026-05-06. As of 2026-05-11 the long-term fix is in place: split intoasterisk-auth(strict,maxretry=3) andasterisk-scan(lenient,maxretry=50/hour) + customer-PBXignoreipwhitelist. See Fail2Ban Runbook and Error 55. - Diagnose silent SIP drops with
tcpdump on eth0first. If packets arrive but Asterisk has no log entry, check fail2ban before chasing PJSIP — iptables drops happen before PJSIP sees them.
Error 53: Auto-ticket classifier tags human calls as bot_dropped for any org reusing extensions 1003/1012/1013¶
Symptom¶
A customer's Tickets page shows bot_dropped urgent-priority tickets ("Call ended without ticket (40s) from
Root Cause¶
The auto-ticket classifier in the LogsUpdate Cloud Run service had this constant:
Any ANSWERED inbound where the destination channel's extension matched this set was treated as a bot call → 8-second wait for the bot to file its own ticket → if no ticket appeared, classifier created a bot_dropped ticket.
This was a global set with no org awareness. At inception only one org used these numbers for bots. Later orgs onboarded humans on the same numbers (e.g. Thangam Communication's 1003 = GokulRaj, 1004 = Raju) and all got false-positive bot_dropped tickets every time the human extension's mobile-forward Dial timed out.
Diagnosis¶
# 1) Confirm the answered ext is human, not bot
ssh root@89.116.31.109
grep -A20 '^exten => 1003,1' /etc/asterisk/ext_<org>.conf
# If you see Dial(PJSIP/<mobile>@<trunk>,...) — it's a human/mobile-forward, NOT a bot
# 2) In the editor's Users page, the target user's routing_type
# should be 'ai_agent' for true bots; 'sip' means human/mobile.
# 3) Cloud Run logs show the misclassification path
gcloud run services logs read logsupdate --region us-central1 --limit 50 \
| grep -E 'AUTO-TICKET|Bot call detected'
Fix¶
Make bot-detection per-org by querying User.routing_type='ai_agent' instead of a hardcoded list.
API side (api/src/server.js) — new internal endpoint:
// POST /api/v1/users/internal/bot-extensions
// Returns the org's extensions where routing_type='ai_agent'.
app.post('/api/v1/users/internal/bot-extensions', async (req, res) => {
if (req.headers['x-internal-key'] !== process.env.INTERNAL_API_KEY) {
return res.status(401).json({ error: 'Unauthorized' });
}
const users = await User.findAll({
where: { org_id: req.body.org_id, routing_type: 'ai_agent', status: 'active' },
attributes: ['extension'],
});
res.json({ extensions: users.map(u => u.extension).filter(Boolean) });
});
Note: the User model's foreign-key column is org_id, not organization_id — easy to miss; staging surfaces it as Unknown column 'User.organization_id'.
LogsUpdate side (server.py) — replace the constant with a 5-minute-cached lookup:
_bot_ext_cache: dict[str, tuple[float, set[str]]] = {}
_BOT_EXT_TTL_SECS = 300
async def get_bot_extensions(org_id: str) -> set[str]:
cached = _bot_ext_cache.get(org_id)
if cached and (time.time() - cached[0]) < _BOT_EXT_TTL_SECS:
return cached[1]
try:
async with httpx.AsyncClient(timeout=5) as client:
resp = await client.post(
f"{PBX_BASE_URL}/api/v1/users/internal/bot-extensions",
headers={"X-Internal-Key": PBX_INTERNAL_KEY,
"Content-Type": "application/json"},
json={"org_id": org_id},
)
exts = ({str(e) for e in resp.json().get("extensions", []) if e}
if resp.status_code == 200 else set())
except Exception as e:
# Fail closed — empty set means we don't tag anything as bot_dropped
# for this org during the failure window. Better than tagging humans.
logger.warning(f"[AUTO-TICKET] bot-extensions lookup failed for org {org_id}: {e}")
exts = set()
_bot_ext_cache[org_id] = (time.time(), exts)
return exts
# At the call site:
bot_extensions = await get_bot_extensions(org_id)
if answered_ext in bot_extensions:
...
Cleanup of existing false-positive tickets — LogsUpdate/scripts/cleanup_bot_dropped_tickets.py lists then optionally deletes bot_dropped tickets for a given org. Dry-run by default; requires re-typing the org UUID at the prompt to actually execute.
Prevention Rules¶
- Classifier rules that branch on extension number must be org-scoped. Extension
1003means different things across orgs. - Hardcoded sets that depend on customer behaviour are short-lived. When a feature has 1 customer the assumption "they're all the same" silently breaks at customer 2. Introduce the per-org config at the same time as the feature, not later.
- Fail closed on classifier-side lookup errors. If the bot-detection API call fails, default to empty set (no calls are bots) rather than empty set fallback to old global list — false-positive tickets create user trust issues; missing a real bot_dropped ticket for a few minutes during an outage is acceptable.
- For Cloud Run + AstraPBX dual-deploy fixes: deploy the API endpoint first (LogsUpdate's old code still works), then deploy LogsUpdate. If LogsUpdate ships first it'd 404 on every call.
Error 54: Outbound dial fails for 12-digit Indian numbers (919XXXXXXXXX) but works for 10-digit¶
Symptom¶
Agent dials 919944421125 (12-digit, with India country code) from a softphone — call doesn't connect or fails fast. The same destination dialled as 9944421125 (10-digit) works fine.
This is a real production issue because Zoiper / most softphones render incoming CallerID with the 91 prefix (E.164-style), so when the agent uses tap-to-call from a missed-call notification or call history, the number includes the 91 and the call fails.
Root Cause¶
Tata's outbound termination expects the bare 10-digit subscriber number, not the 12-digit form with country code. Both formats reach Asterisk and match the catch-all _X. outbound pattern, but only the 10-digit one connects through Tata's network. There was no normalization step in the dialplan to strip the 91 before dialling the trunk.
Diagnosis¶
ssh root@89.116.31.109
# Look up which dialplan match Asterisk picked for the dialed number
asterisk -rx "dialplan show 919944421125@org_<prefix>__outbound"
# Before the fix: only `_X.` matched, EXTEN passed unchanged to Dial(...)
# After the fix: `_91XXXXXXXXXX` is most-specific; Goto strips 91, re-enters
# as 9944421125, then `_X.` Dials the trunk
# Confirm the catch-all is dialing the raw number
grep "Dial(PJSIP" /etc/asterisk/ext_<org>.conf | head -3
Fix¶
Dialplan generator (api/src/services/asterisk/dialplanGenerator.js) emits a strip rule in every outbound context, mirroring the existing E.164 leading-+ strip:
; Normalize Indian country code '91' (Tata trunk expects 10-digit)
exten => _91XXXXXXXXXX,1,NoOp(Stripping 91 country code from ${EXTEN})
exten => _91XXXXXXXXXX,n,Goto(${EXTEN:2},1)
Pattern is intentionally narrow — _91XXXXXXXXXX matches exactly 12 digits starting with 91, so it cannot affect:
- 10-digit dials (still match
_X.directly) - Internal 4-digit extension dials
- Conference / queue / call-forward / speed-dial patterns
- Inbound calls (different context)
- Other trunks for the same org (route ID matching, not EXTEN)
After deploying the API change, regenerate all org dialplans + reload Asterisk:
ssh root@89.116.31.109
cd /opt/astrapbx
node -e '
(async () => {
const ConfigDeploymentService = require("./src/services/asterisk/configDeploymentService");
const inst = new ConfigDeploymentService();
const { Organization } = require("./src/models");
const orgs = await Organization.findAll({ where: { status: "active" } });
for (const o of orgs) {
process.stdout.write(" - " + o.name + " ... ");
try { await inst.deployOrganizationConfiguration(o.id, o.name); console.log("ok"); }
catch (e) { console.log("FAIL: " + e.message); }
}
await inst.reloadAsteriskConfiguration();
process.exit(0);
})();
'
See Outbound Dialplan Normalization for the full reference of all normalization rules.
Prevention Rules¶
- Trunk-format assumptions belong in the dialplan, not in user habit. Don't tell agents "always dial 10-digit" — bake it into the generator so any input format reaches the trunk in the format the trunk expects.
- When introducing a new trunk that needs a different format, gate the strip rules per-trunk. Today all customer trunks share Tata via the NUC tunnel (same upstream); a future trunk that requires
91preserved would need a per-trunk flag onOutboundRoutethat the generator reads. - Use narrow pattern matching, not wildcards, for normalization rules.
_91XXXXXXXXXX(exactly 12 digits) is safe;_91X.(any number starting with 91) would swallow extensions starting with 91 if they ever existed. - After a generator change, regenerate ALL org configs and reload Asterisk. A change in the generator only affects future runs; existing files on disk keep the old behaviour until re-deployed. Run the loop above.
Error 55: Customer PBX IP change → fail2ban storm → all of that customer's endpoints go Unavail¶
Symptom¶
A customer reports "no SIP user is registering for the last N minutes" — was working until recently. pjsip show contacts for the affected org shows the customer's extensions as Unavail RTT=-nan against a specific public IP. Other orgs are fine.
Root Cause¶
The customer's ISP-assigned public IP changed (common for SMB-grade internet in India). When the change happens, all of the customer's PBX-connected phones / their on-prem UCM re-register from the new IP simultaneously — many REGISTER attempts in a few seconds.
Each first REGISTER (before the 401 challenge) emits the log line:
This line is also matched by the legacy fail2ban asterisk filter as a "failed registration". With the old monolithic jail's maxretry=10, a customer with 7+ phones exceeded the threshold in seconds and got banned for 24 h. Asterisk's OPTIONS qualify probes to the customer's old NAT'd ports then arrived at dead pinholes, marking every endpoint Unavail.
Diagnosis¶
ssh root@89.116.31.109
# 1) Is the customer's IP banned?
fail2ban-client status asterisk-auth # current jails (post-2026-05-11)
fail2ban-client status asterisk-scan
# Look for the customer's IP in "Banned IP list"
# 2) Find the IP — current contacts for the org
asterisk -rx "pjsip show contacts" | grep <org_prefix>
# Unavail rows show the IP in the contact URI
# 3) Confirm by tcpdump — packets arrive but Asterisk doesn't reply
tcpdump -ni eth0 'host <CUSTOMER_IP> and udp port 5080'
Fix¶
Immediate: unban the IP.
fail2ban-client set asterisk-auth unbanip <CUSTOMER_IP>
fail2ban-client set asterisk-scan unbanip <CUSTOMER_IP>
# Existing stale contacts auto-clear within ~5 min as the customer's phones
# send fresh REGISTERs and create new NAT bindings.
Recurrence prevention: add the customer's PBX subnet to fail2ban's ignoreip whitelist + ensure the two-jail split (auth vs scan) is in place. See Fail2Ban Runbook — full operational reference.
Prevention Rules¶
- Whitelist known customer PBX subnets. Use
/24(256 IPs) — narrow enough to avoid hiding real attackers, broad enough to absorb ISP renumbering within a subnet. Add via[DEFAULT] ignoreipin/etc/fail2ban/jail.local. The runbook documents the format. - Keep the asterisk filter split (
asterisk-authstrict,asterisk-scanlenient). The single-jail design pre-2026-05-11 conflated handshake-noise with attack-noise; never go back without a replacement. /24, not/16. A/16whitelists 65k IPs across the same ISP block — too broad./24keeps the surface narrow while covering ISP renumbering within a sub-block.- When a customer's IP changes legitimately and falls outside their whitelisted /24, update the whitelist promptly. Don't expand to
/16as a shortcut — keep the discipline.
Error 56: SIP phone shows "unregistered" with zero packets reaching cloud — transport TCP/UDP mismatch¶
Symptom¶
A new IP phone is configured with the correct SIP server (10.20.0.1:5080), correct username + auth ID + password, account is marked Active, and the phone is on a network whose WireGuard tunnel to cloud is healthy — but the phone reports "Not registered" / "Failed" in its account status.
On the cloud side:
pjsip show endpoint <endpoint_name>returns the endpoint with stateUnavailableand no Contact row.pjsip show contacts | grep <org_prefix>shows zero contacts for that extension.wg show wg1shows the customer's peer connected with a recent handshake and bidirectional byte counters — tunnel is fine.- Another extension on the same customer LAN is registering normally (e.g. ext 09 from
192.168.0.76Avail, ext 03 from192.168.0.62Unavailable). tcpdump -i wg1 -n 'udp port 5080'captures zero packets in a 5–10 second window — REGISTER attempts from the broken phone never reach the cloud at all.
Root Cause¶
The phone's account is configured with TCP transport while the Astradial cloud Asterisk PJSIP endpoint is configured with transport-udp (UDP only, bound on 0.0.0.0:5080). The phone happily opens a TCP socket toward 10.20.0.1:5080; Asterisk doesn't listen for TCP on that port, so the SYN is either silently dropped at the cloud (no listener) or the TCP handshake completes against nothing and the REGISTER never gets parsed.
UDP REGISTERs would arrive and be visible in tcpdump and Asterisk logs. TCP REGISTERs to a UDP-only listener leave no trace on either side — which is why this diagnosis is initially confusing.
This is a v7-style edge case but applies to any customer onboarding a new phone after the initial install. The first phone gets configured correctly (often by Astradial directly), then the hotel's IT person duplicates the config for a second account and the transport setting gets toggled (often by accident in vendor web UIs where the dropdown defaults to TCP/TLS for "security").
Diagnosis¶
ssh root@89.116.31.109
# 1) Endpoint exists + auth password matches DB?
asterisk -rx "pjsip show endpoint <endpoint_name>"
# Look for State + Aor + Contact rows. Unavailable + zero contacts = phone never registered.
asterisk -rx "pjsip show auth <endpoint_name>_auth"
# Compare the `password` field with the users.sip_password column in the DB.
# 2) Tunnel health
wg show wg1 | grep -A2 "<customer's wg public key>"
# 'latest handshake' should be < 3 min ago.
# 3) Is the phone even sending packets to us?
timeout 10 tcpdump -i wg1 -n 'udp port 5080 and host <customer-LAN-CIDR>'
# Zero packets in 10s = phone isn't sending UDP, OR isn't sending at all.
# 4) Confirm Asterisk's transport binding (sanity check)
asterisk -rx "pjsip show transports"
# Expect transport-udp on 0.0.0.0:5080. We do NOT run a TCP listener on 5080.
If steps 1–3 all check out (endpoint right, auth right, tunnel up, zero packets), the phone's transport is almost certainly TCP/TLS pointing at a UDP-only port.
Fix¶
On the phone's web UI, under the account's SIP Settings → Basic / Network, change the SIP Transport from TCP (or TLS) to UDP. Click Save and Apply (not just Save — Save alone persists config but doesn't restart the SIP stack).
For Grandstream phones (GHP6xx, GRP26xx) specifically: the Transport setting lives under Account N → SIP Settings → Basic Settings → SIP Transport in some firmware versions, and under Account N → Network Settings in others. Search the page for "Transport".
The registration attempt should land within seconds — confirm by re-running step 3 (tcpdump) and asterisk -rx "pjsip show endpoint <endpoint>".
Prevention Rules¶
- When pre-provisioning a new IP phone for an Astradial customer, set SIP Transport = UDP in the master template before handing the device over. Don't trust the vendor's default.
- For Grandstream-family phones, the recommended firmware default is UDP — confirm in the auto-provisioning template if one is used.
- The cloud Asterisk PBX is UDP-only on the SIP port (
0.0.0.0:5080). TLS+TCP support is on the roadmap but not currently enabled — don't tell customers "use TLS for security" until the listener actually exists. - The "zero packets in
tcpdump" diagnostic is the cheapest single test for this class of failure. If a phone says unregistered andtcpdump -i wg1 udp port 5080is silent, suspect transport mismatch first.
Error 57: Greeting/IVR audio silent after TTS upgrade — Asterisk format_wav.so rate mismatch¶
Symptom¶
A queue greeting or IVR prompt was working before; an operator regenerates it (or creates a new one); now callers hear dead silence where the greeting should play. The queue's MOH starts immediately as if no greeting was configured. Reception from editor.astradial.com → Departments → greeting → Preview plays fine in the browser, which makes the diagnosis confusing.
On the cloud side:
- The greeting row exists, has a non-NULL
audio_file(e.g.greeting_<uuid>.wav). - The WAV exists on disk under
/var/lib/asterisk/sounds/greetings/. asterisk -rx "dialplan show <queue-num>@<context>_queue"shows the expectedPlayback(/var/lib/asterisk/sounds/greetings/greeting_<uuid>)priority.- No error in
/var/log/asterisk/fullmentioning the greeting file. - The
.wavwas generated AFTER the most recent TTS-service deploy (e.g. PR #156 / PR #157).
Root Cause¶
Asterisk's .wav format module (format_wav.so) only handles 8 kHz and 16 kHz mono LINEAR16 PCM. From asterisk -rx "module show like wav":
Higher rates (24 / 32 / 44.1 / 48 kHz) are supported only via the format_slin* family with the matching .sln24 / .sln32 / .sln44 / .sln48 extensions — NOT .wav. Putting a 24 kHz file in .wav loads the format module, the module fails to decode, and Playback() silently no-ops. There's no error in the Asterisk log because the file load itself succeeds; the failure is inside the codec's frame loop after the RIFF header parse.
This regression shipped in the Chirp 3 HD TTS upgrade (PR #156 / promoted in #157) where the synthesis rate was changed from 8 kHz → 24 kHz under the (incorrect) belief that Asterisk's WAV reader handles arbitrary rates via the RIFF header. That's true for format_slin* (.sln*); it's false for format_wav (.wav). The cloud's Google TTS Speech API does accept and honor sampleRateHertz: 16000, downsampling its 24 kHz native output server-side — so 16 kHz .wav gives us wideband quality AND Asterisk compatibility.
V7's queue 5001 was the first prod casualty (2026-05-13). Other orgs were spared only because they hadn't yet regenerated any greeting since the upgrade.
Diagnosis¶
ssh root@89.116.31.109
# 1) What sample rate does Asterisk's WAV module actually support?
asterisk -rx "module show like wav"
# Look for "8kHz/16kHz" in the description. If the description doesn't list
# your file's rate, that's the bug.
# 2) Confirm the actual rate of the file on disk
file /var/lib/asterisk/sounds/greetings/greeting_<uuid>.wav
# Expect output like ... mono <RATE> Hz. If RATE is 24000+ AND module supports
# only 8k/16k, you're hit.
# 3) Cross-check the TTS service synthesizes at a supported rate
grep -n SAMPLE_RATE_HZ /opt/astrapbx/src/services/ttsService.js
# Should be 8000 or 16000. If 24000+, this is the bug source.
Fix¶
Two parts: the broken file on disk and the source code that creates new broken files.
a) Restore audio for an already-affected greeting (one file):
ssh root@89.116.31.109
cd /var/lib/asterisk/sounds/greetings/
F=greeting_<uuid>.wav
cp "$F" "$F.bak-24khz"
ffmpeg -y -loglevel error -i "$F.bak-24khz" -ar 16000 -ac 1 -c:a pcm_s16le "$F"
file "$F" # confirm "16000 Hz"
# No Asterisk reload needed — Playback re-opens the file on each call.
Greeting should play correctly on the next call. The backup stays on disk so you can roll back if needed.
b) Fix the source so new greetings come out right:
In /opt/astrapbx/src/services/ttsService.js, the SAMPLE_RATE_HZ constant. Change 24000 to 16000, save, restart astrapbx:
ssh root@89.116.31.109
cd /opt/astrapbx
cp src/services/ttsService.js src/services/ttsService.js.bak-24khz
sed -i 's/^const SAMPLE_RATE_HZ = 24000;$/const SAMPLE_RATE_HZ = 16000;/' src/services/ttsService.js
pm2 restart astrapbx
Then sync the same change back to git on a hotfix branch (otherwise the next CI deploy rsyncs the broken constant back over your hotfix — see the Prod Direct-Edit runbook).
c) (Optional) Re-render every other greeting on prod that was generated at the bad rate:
# Identify any 24kHz files still on disk
for f in /var/lib/asterisk/sounds/greetings/*.wav; do
rate=$(file "$f" | grep -oE '[0-9]+ Hz' | head -1)
[ "$rate" = "24000 Hz" ] && echo "BROKEN: $f"
done
For each, either re-run ffmpeg ... -ar 16000 like (a), or have the operator click "Generate greeting" in the editor (it'll re-synthesize against the patched ttsService and produce a correct file).
Update (post-mortem) — the actual final fix is .ulaw-only, not .wav at 16 kHz¶
The 16 kHz .wav fix above looked correct on paper but failed in practice on the V7 incident the same day because of a second bug under it: this Asterisk build's Playback() doesn't fall back to .wav when a .ulaw sibling is missing on a G.711 mu-law channel. Log signature:
WARNING file.c: Unable to open .../greeting_<id> (format (ulaw)): No such file or directory
WARNING app_playback.c: Playback failed on PJSIP/... for .../greeting_<id>
Asterisk tried .ulaw first (cheapest format for a mu-law channel), failed, and gave up instead of trying the .wav we'd just produced. The PSTN-bound greeting was silent again.
The final answer: synthesize the greeting as MULAW 8 kHz via Google's audioEncoding=MULAW and save as .ulaw (one file, no .wav sibling). PSTN/SIP softphone channels are virtually all G.711 mu-law, so Asterisk reads our bytes and writes them directly with zero transcoding. Wideband softphones (Opus/AMR-WB) get a clean mu-law → slin8 → opus transcode with imperceptible degradation on spoken audio.
Quirk handled by the code: Google's audioEncoding=MULAW response is wrapped in a RIFF/WAVE container (verified empirically — file reports WAVE audio, ITU G.711 mu-law). Asterisk's format_g711.c reader needs RAW bytes. The TTS service strips the WAVE header by locating the data chunk marker and slicing past its 8-byte ID + size prefix before writing.
Final ttsService.js shape:
const AUDIO_ENCODING = 'MULAW';
const SAMPLE_RATE_HZ = 8000;
// ...
async saveGreetingAudio(greetingId, text, language, voice, opts = {}) {
const wrapped = await this.generateAudio(text, language, voice, opts);
const raw = TTSService._stripWavHeader(wrapped);
await fs.writeFile(`greeting_${greetingId}.ulaw`, raw);
}
The dialplan generator continues to emit Playback(/var/lib/asterisk/sounds/greetings/greeting_<id>) with no extension — Asterisk's filename-extension lookup finds the .ulaw automatically.
Prevention Rules¶
- TTS audio for Asterisk playback should ship as
.ulaw(G.711 mu-law) by default. PSTN itself is mu-law; saving as mu-law guarantees bit-perfect preservation of the synthesis output through the dominant call path. Wideband (.wav16 kHz,.sln16, etc.) is only worth the complexity if you have a documented use case where the call leg is HD-codec end-to-end. - Don't trust Asterisk's file-format auto-fallback to do what you'd want. On a mu-law channel,
Playback("greeting_x")tries.ulawfirst; if missing, fall-back behavior is build-dependent and on our deployed Asterisk it just fails. Always provide a file in the format the channel will actually use. - Google's
audioEncoding=MULAWreturns a WAV-wrapped payload, NOT raw mu-law. Always strip the RIFF header before writing as.ulaw(or save as.wavif you don't care about Asterisk's no-transcode path — but see rule 2). - The Asterisk module's description string IS the contract. When
module show like <name>says "8kHz/16kHz" in the description, that's the literal range. Don't infer support from related modules. - Verify the file format AFTER the first TTS synth on prod every time the TTS service changes.
file /var/lib/asterisk/sounds/greetings/greeting_*should report either rawdata(good — raw mu-law) ORWAVE audio, Microsoft PCM, 16 bit, mono 8000/16000 Hz(also good — legacy 8k/16k .wav). Anything else (24 kHz, mu-law inside WAV, etc.) is broken.
Error 58: PSTN inbound silent on Indian Tata trunk — Asterisk needs .alaw sibling for G.711 a-law channels¶
Symptom¶
After fixing Error 57 (switching greetings to .ulaw-only), inbound calls from a softphone (e.g. extension 1009 dialing 5001) work fine and play the new Chirp 3 HD voice. But inbound calls from the PSTN via the Tata trunk (e.g. v7's external DID dialed from a personal mobile) still play either silence or the OLD Allison Smith voice for system prompts ("This call may be recorded for quality and training") even though we regenerated all 44 system prompts in Chirp 3 HD.
Symptoms specifically:
- Softphone-originated calls → new voice. ✅
- PSTN inbound via Tata trunk → old voice OR silence. ❌
/var/lib/asterisk/sounds/en/<prompt>.ulawexists with the new Chirp 3 HD audio./var/lib/asterisk/sounds/en/<prompt>.gsm(or.wav) still exists with the OLD voice — leftover from the legacy stock Asterisk install.- Asterisk
full.logshows lines like: AND
Root Cause¶
Indian PSTN signals G.711 a-law, not mu-law. Tata's NNI trunk negotiates PCMA (a-law) in SDP by convention — this is the European/Indian convention vs. the North American mu-law (PCMU). On a Tata-bound channel, Asterisk's Playback() lookup order is:
<file>.alaw(channel-native — zero transcoding)<file>.slnor<file>.sln*(linear, will transcode)<file>.gsm(legacy — will transcode)<file>.wav(only if 8 kHz/16 kHz LINEAR16 — see Error 57)<file>.ulaw(transcodes to a-law)
Because we only wrote .ulaw files after the TTS migration, Asterisk on an a-law channel tried .alaw (missing) → .sln/.gsm → found the OLD stock-Allison-Smith .gsm in /var/lib/asterisk/sounds/en/ → played the old voice. For operator-created greetings that don't have an old .gsm sibling, it fell through to the broken .wav (Error 57 leftovers) or to nothing at all → silence.
This is invisible during softphone testing because softphones negotiate mu-law and the .ulaw lookup succeeds on path 1 for them — .alaw is never tried.
Diagnosis¶
- Confirm Tata trunk is negotiating a-law. During an active inbound call:
- Compare file presence by codec extension: If
ls /var/lib/asterisk/sounds/en/queue-thankyou.{ulaw,alaw,gsm,sln} 2>/dev/null ls /var/lib/asterisk/sounds/greetings/greeting_<uuid>.{ulaw,alaw} 2>/dev/null.ulawexists but.alawis missing, this error applies. - Tail
/var/log/asterisk/fullduring a test call and watch forUnable to open ... (format (alaw)): No such file or directory. That's the smoking gun.
Fix¶
Two scopes, two fixes:
Scope 1 — system prompts (/var/lib/asterisk/sounds/en/): one-shot regen script writes both .ulaw and .alaw from the same Google TTS call. The script (api/scripts/regen-system-prompts.js) uses Google's audioEncoding=MULAW, strips the RIFF header, and converts mu-law → a-law via ffmpeg locally.
Scope 2 — operator-created greetings (/var/lib/asterisk/sounds/greetings/): ttsService.saveGreetingAudio() writes both .ulaw and .alaw from a single Google TTS call, using a pure-JS ITU-T G.711 mu-law → a-law byte-table converter (no ffmpeg subprocess; ~1 ms overhead). See PR #163.
The byte-table converter was verified byte-for-byte against ffmpeg -c:a pcm_alaw: all 256 mu-law input values produce identical a-law output, and a real 29.5 KB greeting file matches the ffmpeg-generated equivalent exactly.
Backfill for existing operator greetings:
cd /var/lib/asterisk/sounds/greetings/
for f in *.ulaw; do
base="${f%.ulaw}"
if [ ! -f "${base}.alaw" ]; then
ffmpeg -y -loglevel error -f mulaw -ar 8000 -ac 1 -i "$f" \
-ar 8000 -ac 1 -c:a pcm_alaw -f alaw "${base}.alaw"
fi
done
chown asterisk:asterisk *.alaw
Prevention Rules¶
- For any new TTS or sound-file work on this stack, generate BOTH
.ulawAND.alaw. Don't assume mu-law-only is enough — every Indian PSTN customer routes through an a-law trunk. The cost is a 1 ms in-process byte conversion; the benefit is correctness on every codec path. - Channel codec governs file-lookup order, not the dialplan.
Playback(/path/to/greeting_x)with no extension is the only correct form — Asterisk picks the codec that matches the channel. Hard-coding.wavor.ulawin the dialplan breaks the lookup chain. - Test the PSTN path explicitly after any TTS change. Calling from a softphone is not sufficient — softphone codec ≠ PSTN codec. Place a real call from a mobile through Tata to the affected DID before declaring done.
- When you see
Unable to open ... (format (alaw)): No such file or directoryinfull.log, the next prompt Asterisk WILL play is whatever lower-priority file it finds. This is the source of "wrong voice played" reports — old.gsmfiles from the legacy Asterisk install act as silent overrides for missing.alawsiblings. - The ITU-T G.711 mu-law → a-law byte translation is well-defined and lossless within G.711 quantization. A 256-byte lookup table generated once at startup is enough. Don't shell out to ffmpeg in the hot path.
Error 59: Queue "Timeout Destination" silently ignored on save¶
Date: 2026-05-20 (Om Chambers prod, queue 5003 "test") Severity: P1 — operator-visible misroute on queue timeout Fix: PR #251 (allow-list + validator), PR #252 (picker UX), PR #253 (picker state fix)
Symptom¶
Operator configures Queue → Edit → Timeout Destination in the editor, saves, re-opens the dialog: the type dropdown reverts to whatever was there before. Live calls that time-out on Max Wait either:
- Play "this number is incorrect" (when stale
type=phone, destination=5004got dialled out the Tata trunk and rejected), OR - Fall through to
(unavail)and play "all agents are busy"
depending on the stale data.
Root cause¶
api/src/routes/queues.js PUT handler's allowedFields allow-list was missing both timeout_destination and timeout_destination_type (and greeting_id, found alongside). The editor sent all three fields on every save; the API filtered them out before calling Queue.update(...). Whatever combo was last persisted by some other path (admin SQL, an earlier code version, a manual DB poke) stayed in the DB forever, the editor showed the stale data, and the dialplan generator used those stale values.
When the stale combo was {type:'phone', destination:'5004'} (Om Chambers' supervisors queue extension), the generator emitted Dial(PJSIP/5004@<trunk>, 30, tT). Tata's SBC rejected the call (5004 is not a valid PSTN destination), the dial returned CHANUNAVAIL, and the timeout routing silently failed.
Diagnostic commands¶
# 1. Read the DB row directly (this is what the dialplan actually uses)
ssh root@89.116.31.109 'cd /opt/astrapbx && node -e "
const {Queue} = require(\"./src/models\");
Queue.findOne({where:{number:\"5003\"}}).then(q =>
console.log(JSON.stringify({
number:q.number, name:q.name,
timeout_destination:q.timeout_destination,
timeout_destination_type:q.timeout_destination_type
}, null, 2))
);"'
# 2. Check what the generated dialplan emits
ssh root@89.116.31.109 'asterisk -rx "dialplan show 5003@org_<ctx>__queue" | grep -A2 "n(timeout)"'
# 3. Confirm the type-dialled context exists. For type=queue:
# Goto(org_<ctx>__queue, <dest>, 1)
# For type=extension:
# Goto(org_<ctx>__internal, <dest>, 1)
# For type=phone:
# Dial(PJSIP/<digits>@<trunk>, 30, tT) ← only valid for real phone numbers
Fix¶
Three PRs landed the full fix:
- PR #251 added
timeout_destination+timeout_destination_type+greeting_idto the PUTallowedFieldsarray AND added a server-side validator (queues-helpers.validateTimeoutDestination) that rejects(type, destination)pairs the dialplan generator would misroute. 12 unit tests cover the rule set including the exact regression (4-digit "phone number" → 400). - PR #252 replaced the type+destination two-field combo in the editor with a smart picker: kind buttons
[ No routing | User | Queue | Phone ]+ a contextualSearchableSelectper kind. Operators no longer have to know which Asterisk context their destination lives in. - PR #253 fixed a derived-state bug in #252 where clicking the kind button visually toggled but the dropdown below didn't appear (mode was being derived from form values; clicking a button cleared the destination, which made the derived mode fall back to "none").
Cleanup for existing stale data¶
The code fix doesn't repair existing bad rows. After deploy, open the queue in the editor — the picker shows the stale state clearly (e.g. Phone: 5004) so the operator can correct it and save. The validator now rejects bad combos at save time, so it can't be re-broken.
Rule for future allow-lists¶
When adding columns the editor sends, add them to the PUT handler's allowedFields array in the same PR. The silent-drop pattern is hard to detect — the API returns 200, the editor shows a success toast, but nothing persisted. Test by saving + reopening, not by checking the response status.
Error 60: PJSIP reload deadlock after concurrent reloadAsteriskConfiguration() calls¶
Dates: 2026-05-19 (Tata path silent-drop), 2026-05-20 (Kolathur DID approve) Severity: P0 — all inbound DIDs play "number not in service" Fix: PR #255 — serialize reload calls + targeted module reloads
Symptom¶
After an admin action that triggers autoDeploy() (DID approve, queue save, user update, IVR save), every inbound Tata call plays "this number is incorrect" / "number not in service." The Asterisk dialplan looks correct on inspection; the dispatcher has the right Goto for every DID. But:
tcpdumpshows OPTIONS from NUC (10.10.10.2:5060) arriving on prod'swg0interface, with no reply transmitted.pjsip show aor cloud-aoron NUC reportsUnavailable(no response to qualify probes).asterisk -rx 'pjsip reload'on prod returns"A module reload request is already in progress; please be patient"— for any command involving a reload, indefinitely.pm2 logs astrapbxmay show one or more🔄 Reloading Asterisk configuration...lines without the matching✅ Asterisk configuration reloadedline.core show channelsshows 0 active channels.
The only recovery is to SIGKILL the asterisk process and start it fresh.
Root cause¶
api/src/services/asterisk/configDeploymentService.js had ~18 call sites in server.js that fired reloadAsteriskConfiguration() with no serialization. Two admin actions in close succession (e.g. a DID approve immediately after a queue save) launched two concurrent asterisk -rx "core reload" shell calls. Asterisk's loader.c rejected the overlap with:
Normally harmless. But when the first reload wedged inside res_pjsip — observed both incidents while it was loading a brand-new endpoint file (pjsip_<new_org>.conf) for the first time after a DID approve — the queue piled up. Every subsequent reload returned "previous reload didn't finish yet" forever, the CLI mutex never released, and SIP processing broke. tcpdump showed receives but no transmits because the reply path couldn't acquire the wedged mutex.
The smoking gun in /var/log/asterisk/full.log at the 2026-05-20 incident:
[May 20 17:17:37] VERBOSE[484509] loader.c: The previous reload command didn't finish yet
[May 20 17:17:37] VERBOSE[484512] loader.c: The previous reload command didn't finish yet
[May 20 17:17:37] VERBOSE[484515] loader.c: The previous reload command didn't finish yet
Three concurrent rejections, all 11 minutes after the 17:06 reload fired and stuck.
Diagnostic commands¶
# 1. Confirm the symptom: tcpdump shows receives but no transmits from NUC
ssh root@89.116.31.109 'timeout 6 tcpdump -i any -n -s 0 -nn -l "udp and host 10.10.10.2"'
# Expected (broken): only `wg0 In` lines, no `wg0 Out` from 10.10.10.1
# Expected (healthy): both In and Out, with `200 OK` replies
# 2. Probe the reload queue (this hangs when deadlocked)
ssh root@89.116.31.109 'timeout 6 asterisk -rx "pjsip reload"'
# Broken: "A module reload request is already in progress; please be patient" repeating
# Healthy: short response, then prompt returns
# 3. Smoking gun in messages.log
ssh root@89.116.31.109 'grep "previous reload command" /var/log/asterisk/full.log | tail -5'
# Multiple lines at the same timestamp = concurrent reload pile-up
Immediate recovery (when deadlocked)¶
This is what worked twice; takes ~15 s of downtime, 0 active calls is the expected state when deadlocked.
ssh root@89.116.31.109
PID=$(pgrep -x asterisk)
kill -9 $PID
sleep 3
rm -f /var/run/asterisk/* # stale control socket
/usr/sbin/asterisk # daemon start
sleep 10
asterisk -rx "core show uptime" # confirm CLI responsive
asterisk -rx "pjsip show aor tata_gateway" | grep -i avail
# tata_gateway should be Avail with RTT ~150ms within 30s
Permanent fix (PR #255, deployed 2026-05-20)¶
reloadAsteriskConfiguration() now serializes via an instance promise chain:
class ConfigDeploymentService {
constructor() {
// …
this._reloadLock = Promise.resolve();
}
async reloadAsteriskConfiguration() {
const previous = this._reloadLock.catch(() => {});
this._reloadLock = previous.then(() => this._doReload());
return this._reloadLock;
}
async _doReload() { /* the actual reload work */ }
}
Concurrent callers queue in JS — exactly one asterisk -rx shell call is in flight at a time.
Also replaced the heavyweight core reload (which reloaded every module including res_pjsip even when only ext_*.conf changed) with targeted module reloads matching exactly what the service rewrites:
dialplan reload → ext_*.conf
module reload res_pjsip.so → pjsip_*.conf
module reload app_queue.so → queues_*.conf
Plus a 750 ms settle delay before seedQueueMemberDevstates() so the per-member asterisk -rx "devstate change …" CLI calls don't race the tail-end of the reload sequence on Asterisk's CLI mutex.
Verification post-fix¶
# Fire two concurrent regen-gateway calls
ssh root@89.116.31.109
KEY=$(grep ^INTERNAL_API_KEY /opt/astrapbx/.env | cut -d= -f2-)
curl -s -X POST -H "X-Internal-Key: $KEY" http://localhost:8000/api/v1/admin/regenerate-gateway &
sleep 0.3
curl -s -X POST -H "X-Internal-Key: $KEY" http://localhost:8000/api/v1/admin/regenerate-gateway &
wait
# Confirm BOTH logged as separate reloads (serial), no collision errors
pm2 logs astrapbx --lines 60 --nostream | grep -E 'Reloading|reloaded \(dialplan'
# Expect 2 "🔄 Reloading…" + 2 "✅ … reloaded (dialplan + res_pjsip + app_queue + devstate seed)"
grep "previous reload command" /var/log/asterisk/full.log | tail -5
# Expect empty (no collisions)
Rule for future reload paths¶
- Never call
core reloadfrom the API. Targeted module reloads only. Each new file type you generate needs its own targeted reload command, not a blanket "reload everything." - Any new call site that fires reload-affecting CLI commands MUST go through
reloadAsteriskConfiguration()so the serialization applies. Don't add rawasterisk -rx "module reload …"calls elsewhere. - The 30-second grace window in
pollCdris unrelated — that's the CDR ingest classifier, not the reload path. Don't conflate them.
Updated 2026-05-22: PR #255 doesn't fully eliminate the wedge¶
The 2026-05-22 ~10:23 IST incident on prod showed that even with PR #255's _reloadLock promise-chain serialization, a single reload can still wedge inside res_pjsip. The wedge happens when the reload thread walks PJSIP sessions and hits one in a transient state (mid-call cleanup, mid-hangup-handler, etc.) — Asterisk's session-walk has a known race that occasionally hangs there.
Once one reload wedges, the reload mutex is held forever:
asterisk -rx "module reload res_pjsip.so"
# returns: "A module reload request is already in progress; please be patient"
# (forever, until the asterisk process is killed)
Calls that are mid-hangup at the moment of wedge get permanently stuck in the hangup handler. Their channel structs survive in core show channels listing, accumulate in the Editor's Live Calls UI as "zombies", and won't release until Asterisk is restarted.
PR #255 reduces the frequency of wedges (no concurrent reloads piling up to amplify the bug surface) but does not eliminate it. The recovery procedure is unchanged: SIGKILL + restart fresh.
Detection: zombie-channel watchdog (2026-05-22)¶
The API now runs a defensive periodic check (every 15 min) via api/src/services/zombieChannelWatchdog.js. Per tick:
- Reads
core show channels conciseand classifies stuck channels: context = '*__hangup' AND exten = 'h' AND age > 5 min→ hangup-handler stuckstate = 'Down' AND age > 2 min→ Down state stuck- Tries
channel request hangupon each. Works for simple zombies. - If most/all stuck channels survive the hangup → confirms Error 60 signature (
channel request hangupis a no-op because the reload mutex is wedged). - On confirmed signature: opens a GitHub issue with the
auto-zombie-alertlabel, @-mentions the on-call operator (env:GH_OPS_MENTION, default@harisuryaa). GitHub Mobile push notification fires. Issue body contains the exact SIGKILL+restart command sequence. - Auto-closes the issue after 2 consecutive clean ticks (signals successful recovery).
Watchdog never auto-restarts Asterisk — operator authorization required per CLAUDE.md Rule 3.
Zombie-channel safe characteristics¶
The watchdog targets only patterns impossible-by-design for real calls. Documented for posterity so the rules can be re-derived if the code is lost:
- A channel in
h@*__hangupexten only enters that state AFTER the caller has disconnected. The hangup handler exists to stamp CDR fields + fire webhooks; it must finish in milliseconds. Anything >5 min there is a zombie. - A channel in
Downstate has no audio path. Asterisk's reaper normally cleans these in seconds. Anything >2 min there is a zombie.
The watchdog deliberately ignores long-running Up/Ring/Dial channels — those can legitimately last 30+ min on hospital queue lines, agent-on-call sessions, etc.
Error 61: Trunk max_channels stored in DB but never enforced — UI lies about concurrent-call cap¶
Dates: 2026-05-20 Severity: P2 — Trunk "Channels" field on the editor is a knob with no effect; org's limits.concurrent_calls is the only real cap Fix: PR #260 (dialplan enforcement) + PR #262 (effective-cap UI) + PR #263 (CDR cap_rejected userfield + call-logs badge)
Symptom¶
In the editor Trunks page, every trunk shows a Channels field (e.g. Tata SIP Trunk = 50). Operator sets it lower (e.g. 10) and saves successfully. But the trunk continues to accept 11+ concurrent outbound calls — the cap appears to do nothing.
Simultaneously, the admin Organization page shows a different "Concurrent Calls" number (e.g. 10) under organizations.limits.concurrent_calls, with no indication of how it relates to the trunk's "Channels" field.
Root cause¶
Two issues stacked:
-
POST
/api/v1/trunksallowedFieldsdid not includemax_channels. The editor sent the value; the API silently dropped it; the DB stored the model default (50). This is the silent-drop pattern (see also Error 50, Error 51, Error 59) — API returns 200, editor shows success toast, but the field never persists. -
The dialplan never read
sip_trunks.max_channelsat all.dialplanGenerator.generateOutboundContext()only counted concurrent calls against the org-levellimits.concurrent_callsvia a singleGROUP(orgCap)/GROUP_COUNT(orgCap@app)pair. There was no per-trunk counter, so the trunk "Channels" knob had nothing reading it on the call path.
So the editor showed two caps (org + trunk) but only one (org) was ever enforced.
Diagnostic commands¶
# Confirm DB value matches what UI showed (per-org)
ssh root@89.116.31.109 'mysql -u root pbx_api_db -e "
SELECT name, max_channels FROM sip_trunks WHERE organization_id = '\''<org-uuid>'\'';"'
# Confirm dialplan does NOT have a per-trunk GROUP() block (broken state)
ssh root@89.116.31.109 'grep -A 2 "GROUP(.*trunkCap" /etc/asterisk/ext_<org>.conf'
# Empty output = bug present
Fix (PR #260)¶
generateOutboundContext() now emits a per-trunk concurrency check before the actual Dial:
exten => _X.,n,Set(GROUP(trunkCap_<trunkId>)=trunkCap_<trunkId>)
exten => _X.,n,GotoIf($[${GROUP_COUNT(trunkCap_<trunkId>@trunkCap)} > ${TRUNK_MAX}]?trunk_limit_reached,1)
Both org and trunk caps fire before the Dial. The effective cap a caller actually experiences is min(org_cap, trunk_cap).
The earlier allowedFields silent-drop was fixed in the same PR by adding max_channels to the POST handler's allowed-fields array.
How operators tell which cap blocked a call¶
PR #263 stamps the CDR userfield column at the rejection point:
trunk_limit_reached,1,Set(CDR(userfield)=trunk_cap_rejected)
org_limit_reached,1,Set(CDR(userfield)=org_cap_rejected)
The /api/v1/calls SQL pulls userfield into cap_rejected ('org' or 'trunk') on each row. The call-logs page shows a destructive-variant badge "Org cap" or "Trunk cap" instead of the normal Completed/Missed status (PR #263, editor/app/dashboard/[orgId]/calls/page.tsx).
How operators see the effective cap (PR #262)¶
Trunks page shows the effective cap on the "Channels" column (e.g. 10 / 50 = trunk allows 50 but org caps at 10, so 10 wins). Edit and Create dialogs include a helper line "Effective cap: min(trunk, org)". Admin org page also shows the same effective number.
Prevention rule¶
When adding a UI knob whose effect requires dialplan enforcement:
- Read the dialplan generator. If there's no code that consumes the field at call time, the knob is decorative.
- Test by exceeding the cap on real traffic (place N+1 concurrent calls), not by checking that the field persisted.
- Stamp CDR userfield at rejection points so operators can tell which cap blocked the call. Stats endpoints + call-logs UI should surface this distinction.
- See
features/concurrent-call-cap.mdfor the full architecture.
Error 62: Call-logs shows "Internal" for softphone-originated PSTN outbound calls¶
Dates: 2026-05-20 Severity: P2 — ~19% of outbound calls miscategorised in call-logs + dashboard stats Fix: PR #265 (staging) → PR #266 (production)
Symptom¶
Agent picks up a softphone, dials a PSTN number (e.g. customer mobile), call connects and completes normally. The corresponding row in /dashboard/<orgId>/calls shows Direction = Internal instead of Outbound.
The dashboard's per-day call-volume chart undercounts outbound by the same magnitude.
Root cause¶
Asterisk's CDR records dcontext as the originating context, not whichever included context contained the matched extension pattern.
Softphone PJSIP endpoints have context=<org>__internal. The outbound dial patterns (_X., etc.) live in <org>__outbound and are pulled into __internal via include => <org>__outbound. When the caller dials a PSTN number, Asterisk:
- Matches the pattern in
__outbound(via the include) - Executes the Dial against the trunk
- Writes CDR with
dcontext='<org>__internal'(the originating context)
The previous direction CASE in api/src/server.js was:
WHEN t.dcontext = 'ai-outbound' THEN 'outbound'
WHEN t.dcontext LIKE '%incoming%' THEN 'inbound'
WHEN t.dcontext LIKE '%outbound%' THEN 'outbound'
ELSE 'internal'
So <org>__internal fell to the ELSE branch and got labeled "internal" — even when the call went out to PSTN via a trunk.
Fix (PR #265/#266)¶
Added a fourth CASE branch that catches the softphone→trunk pattern:
WHEN t.dcontext LIKE '%internal'
AND t.lastapp = 'Dial'
AND t.lastdata LIKE '%@%trunk%' THEN 'outbound'
ELSE 'internal'
All three signals are required: - dcontext LIKE '%internal' — call originated in a softphone's home context - lastapp = 'Dial' — last app was a Dial (rules out PlayBack, queue announce, etc.) - lastdata LIKE '%@%trunk%' — Dial argument referenced a trunk endpoint (rules out internal extension-to-extension Dial)
Applied in 3 places in api/src/server.js: - Row-level CASE in main GET /api/v1/calls query - Row-level CASE in secondary single-line query - Stats SUMs (SELECT … FROM asterisk_cdr weekly + totals) so dashboard counters agree with row labels
Verification on real prod data before merge¶
Ran the old vs new CASE side-by-side over the last 7 days (2257 rows):
| Old direction | New direction | Count |
|---|---|---|
| inbound | inbound | 1202 |
| internal | internal | 974 |
| internal | outbound | 429 ← bug fix |
| outbound | outbound | 52 |
Zero false flips. Spot-checked 5 of 429 — all genuinely outbound (org's outbound DID as src, 10-digit customer number as dst, channel from softphone endpoint, dstchannel to trunk endpoint).
Prevention rule¶
dcontext alone is unreliable for direction inference. When dialplan A include =>s dialplan B, Asterisk records A as the dcontext even when the executing logic came from B. Future direction classifiers must look at lastapp and lastdata together, not just dcontext. For new outbound patterns, prefer either:
- A dedicated outbound context that's the originating context for the channel (set via
Gotobefore the Dial), so dcontext alone is sufficient, OR - Multi-signal CASE branches like the one above, with verification on real CDR data before merging.