Queue Routing Architecture¶
Canonical reference for how inbound queue calls flow through the cloud Asterisk on Astradial, what the dialplan generator emits per member, and the non-obvious behaviours that have bitten us. Read this before touching dialplanGenerator.generateQueueMemberContext, queueService.generateSingleQueueConfig, or the auto-ticket classifier.
Call path for a queue call¶
PSTN → Tata NNI → NUC Asterisk → (WireGuard) → Cloud Asterisk:
1. tata-did-route (match DID → org context)
2. org_<ctx>__incoming (Answer, MOH, CDR setup, MixMonitor)
3. org_<ctx>__queue (queue extension wrapper)
4. Queue(org_<ctx>_<num>, ct, , , <max_wait>)
5. queues.conf member → Local/qm<id>@org_<ctx>__qmem/n
6. org_<ctx>__qmem qm<id> helper extension:
- Set per-leg CDR + recording filename
- Set CALLERID(num) to org's outbound DID ← critical, see below
- Dial(PJSIP/<phone>@<trunk>) or Dial(PJSIP/<endpoint>) or Stasis()
- Hangup()
7. Member's phone rings; on answer the call bridges via the Local channel pair
The per-member helper context org_<ctx>__qmem is what enables per-member ring_timeout_seconds, per-leg recording, and the outbound CallerID rewrite. Without the helper, queues.conf can only set one global ring time for all members of a queue.
Member dial types¶
Queue members emit one of three dial paths depending on users.ring_target and users.routing_type:
| Type | Condition | qm helper emits |
|---|---|---|
| softphone | ring_target='ext' and asterisk_endpoint present | Dial(PJSIP/<endpoint>, <ringTime>, tT) |
| phone-target | ring_target='phone' and phone_number present | Set(CALLERID(num)=<org-DID>) then Dial(PJSIP/<10-digit>@<trunk>, <ringTime>, tT) |
| ai_agent | routing_type='ai_agent' | Stasis(pbx_api, ai_agent, <routing_destination>) |
The state_interface written to queues.conf for each member follows the same partition:
- softphone →
PJSIP/<endpoint>(real device state, queue can reliably track busy) - phone-target / ai_agent →
Custom:qm<id>(placeholder; nothing publishes its state)
Critical gotchas¶
1. Custom: devstates default to UNKNOWN, not NOT_INUSE¶
Asterisk reports any Custom:<name> device that has never had DEVICE_STATE() published as UNKNOWN. app_queue with ringinuse=no skips UNKNOWN members. The generator MUST emit ringinuse=yes whenever any active member uses Custom: state_interface — otherwise the queue silently never rings phone-target / ai_agent / no-endpoint members.
configDeploymentService.seedQueueMemberDevstates() publishes NOT_INUSE for every Custom:qmringinuse=no is left in place.
2. queue.timeout is the ROUND budget, not per-member¶
timeout= in queues.conf governs the maximum time the queue will spend trying members in one round before falling back to retry. It is NOT a per-member cap. If member 1 has ring_time=60 and the queue.timeout=60, member 2 never gets a turn — the round budget is consumed by member 1 alone.
Generator computes the round budget as SUM(member ring_timeout_seconds) + 10s buffer. Operator does NOT see this field in the editor; it is derived purely from per-member ring times.
3. GotoIf(${X}=Y?...) is always TRUE — wrap in $[...]¶
Asterisk does NOT do string comparison on bare GotoIf conditions. After variable substitution it evaluates the string for truthiness, and any non-empty string is true. Every conditional emit MUST wrap the comparison in $[...]:
WRONG: GotoIf(${QUEUESTATUS}=TIMEOUT?timeout) ← always jumps to timeout
RIGHT: GotoIf($[${QUEUESTATUS}=TIMEOUT]?timeout) ← actual string equality
Same rule applies to DEVSTATE / DIALSTATUS / any string check. Tests in api/tests/sql-invariants.test.js enforce this generator-wide.
4. Outbound CallerID on phone-target queue members¶
When the qm helper dials the trunk for a phone-target member, the From header inherits the parent channel's CallerID — which is the external INBOUND caller's number (e.g., a customer dialing in). Tata's SBC will NOT accept a From that isn't a registered DID for that trunk: it substitutes with the trunk's GLOBAL default DID.
On prod this default is +918065978001, an unassigned staging-routed DID, which made it look like a cross-org leak (the dialled member's phone showed 918065978001 ringing them for an Om Chambers call). Fixed: qm helper now sets CALLERID(num) to the org's outbound DID before the trunk Dial:
Set(CALLERID(num)=<user.outbound_did or org.default_did or org.first_did>)
Dial(PJSIP/<phone>@<trunk>, <ringTime>, tT)
The selection priority matches generateUserExtension. org.dids is scoped to the org's assignments, so cross-org pickup is impossible by construction.
5. Native mobile "decline" is NOT a SIP REJECT¶
When the dialled number is a mobile via Tata trunk, native phone dialer behaviour for "decline" depends on the carrier:
- Softphone (Zoiper, Linphone, Astradial app) → sends SIP 486 Busy Here → Asterisk Dial returns BUSY → queue advances in <1s
- Native Indian mobile dialer (most carriers) → carrier intercepts; either rings out the full window before reporting NOANSWER, OR routes to operator voicemail (which Asterisk sees as ANSWERED and the queue bridges to)
No dialplan change can override this — the SBC is delivering what the carrier says. For fast decline-handling, the member needs a SIP softphone or a call-screening layer (e.g., "press 1 to accept"). Otherwise expect up to ring_timeout_seconds of wait before queue advance on a mobile decline.
6. penalty is a tier GATE for every strategy except linear — not a sort key¶
In Asterisk's app_queue the penalty field on a member => line has strategy-dependent semantics:
linear— file order + penalty is a sort tiebreak. Lower penalty rings first; distinct penalties order the members.ringall/leastrecent/fewestcalls/random/rrmemory— penalty is a tier gate. Only members at the lowest currently-active penalty tier are eligible per round; higher-penalty members join later rounds aftertimeout(the round budget) elapses with no answer. Distinct penalties under any of these strategies reduce the queue to serial-by-priority instead of the operator-expected parallel ring.
The editor's "priority up/down arrows" mutate each queue_members.penalty field. Pre-fix, the generator wrote those penalty values verbatim into queues.conf regardless of strategy — so an operator reordering members on a ringall queue (say Landline, Raghavi, Pavithra) ended up with penalties 0, 1, 2 and Asterisk rang them one at a time across rounds. Reproduced 2026-05-18 on Thangavelu Hospital queue 5001 (ringall with 5 members at penalties 0–5) and Om Chambers queue 5003 (ringall with a single member at penalty 1).
Fix (PR #235 / 2026-05-18): queueService.generateQueueMemberString accepts the parent queue and flattens emitted penalty to 0 for every non-linear strategy, so all members sit at the same tier and ring in parallel. File order remains penalty-ascending (purely cosmetic, kept for UI/file-readability consistency). For linear, stored penalty values are preserved exactly. The editor's UI now hides the priority arrows whenever queue.strategy !== 'linear' to prevent operators from creating bad penalty data in the first place. Tests parameterise across all 5 non-linear strategies in api/tests/queue-service.test.js (Q15b/Q15c).
Operational follow-up after the fix shipped: node /opt/astrapbx/scripts/regen-all-org-configs.js was run on staging and prod to flush every active org's queues_*.conf so the on-disk files actually pick up penalty=0 (the script writes the files + does ONE AMI reload at the end — see Regen org configs workflow below). New emission-shape changes that affect generated configs should follow the same pattern.
Regen org configs workflow¶
PR #237 / 2026-05-18 added .github/workflows/regen-org-configs.yml — a manual workflow_dispatch button (Actions → "Regen org configs" → Run workflow → pick staging or production) that:
- Lands on the matching
[self-hosted, <env>]runner (same labels as Deploy API workflows). - Pre-flight grep aborts if the deployed
queueService.jsdoesn't carry the expected fix signature. - Backs up every
/etc/asterisk/queues_*.confto/root/queues-bak-<ts>/. - Runs
node /opt/astrapbx/scripts/regen-all-org-configs.js. - Audits non-zero penalty members before and after (post-state should only show
linearqueue members with non-zero penalties). - Prints the rollback path with the backup directory baked in.
Use this instead of manual SSH whenever an emission-shape change has landed and existing queues_*.conf need to be rewritten — it's audited in the Actions log, runs on the same trusted runner as the deploys, and produces the same outcome as the manual flow.
Post-Queue() routing¶
After Queue(...) returns, the dialplan branches on QUEUESTATUS:
GotoIf($[${QUEUESTATUS}=TIMEOUT]?timeout) → caller waited > max_wait, route to timeout dest
GotoIf($[${QUEUESTATUS}=ANSWERED]?normal_end) → call was answered, hang up cleanly
GotoIf($[${QUEUESTATUS}=CONTINUE]?normal_end) → caller pressed digit to exit, hang up
Goto(unavail) → empty / full / unavailable queue states
n(normal_end),Hangup() → clean end
n(timeout),Goto(<configured timeout destination>)
n(unavail),Playback(all-agents-busy) → Hangup → "all agents are busy" announcement
If the caller hangs up while in queue, the channel is destroyed and Asterisk does NOT execute this post-Queue() block — the h-extension hangup handler runs instead. The labels above are only reached when the caller is still alive.
Auto-ticket classifier interaction¶
pollCdr (api/src/server.js, runs every 30s) reads new CDR rows, dedups by linkedid, and forwards one representative row per call to classifyAndUpsertTicket.
Three behaviours worth knowing:
- 30-second grace window: rows whose
calldate + durationis less thanNOW() - 30sare skipped. This lets all retry CDRs for a single queue session settle in the DB before classification, so a retry-then-answered call never creates a false "Queue Timeout" ticket. - ANSWERED-preferring dedup: when multiple CDR rows share a linkedid (queue retried, multiple member attempts), the dedup picks the row where
disposition='ANSWERED' AND billsec > 0first, then longest duration as tiebreak. - Queue bridge recognition: a
Local/qm<hex>@…dstchannel withlastapp='Queue'andbillsec > 0is a real bridge — classifier returnsqueue_answered, no ticket. Without this, every queue-answered call would create a missed-call ticket because the dstchannel isn't a direct PJSIP endpoint. - Cross-batch auto-close: if a NO_ANSWER row from round 1 creates a ticket before the round 2 ANSWERED row arrives, the classifier closes the open ticket when it later sees the answered row (matched by org + caller_number + last_call_id within 10 min).
Files in scope¶
| File | Purpose |
|---|---|
api/src/services/asterisk/dialplanGenerator.js | Emits the qm helper context, queue extension, user extension, IVR, hangup handler |
api/src/services/asterisk/queueService.js | Emits queues.conf (one section per queue + member => lines) |
api/src/services/asterisk/configDeploymentService.js | deployOrganizationConfiguration, reloadAsteriskConfiguration, seedQueueMemberDevstates |
api/src/services/ticketClassifier.js | Per-CDR-row decision: skip, ticket, or auto-close |
api/src/server.js | pollCdr, /api/v1/calls, /api/v1/calls/live, /api/v1/calls/:linkedId/journey, /api/v1/calls/:callId/recording |
api/tests/*.test.js | 125+ unit tests via node:test; run with npm test |
Operator-facing knobs¶
| Editor field | Maps to | What it actually does |
|---|---|---|
| Queue → Strategy | queues.conf strategy | linear / ringall / leastrecent / fewestcalls / random / rrmemory |
| Queue → Max Wait (sec) | Queue() 5th arg | total time caller sits in queue across all rounds → QUEUESTATUS=TIMEOUT |
| Queue → "On timeout, route caller to" (picker) | timeout_destination_type + timeout_destination → post-Queue Goto | Where caller routes when Max Wait expires. Smart picker introduced in PR #251/#252: kind buttons [ No routing | User | Queue | Phone ] + a contextual SearchableSelect per kind. The picker writes both type and destination atomically so they can never disagree — pre-fix, the two-field combo let operators save type=phone, destination=5004 (the supervisors queue extension), which made the generator dial 5004 out the trunk and Tata rejected it. See Error 59. |
| Queue → Member → Ring Timeout (sec) | qm helper Dial(..., N, ...) | this member's individual ring window per attempt |
| Queue → Member → Priority | queues.conf member => …,<penalty>,… (only for linear strategy — see Gotcha #6) | sort/tiebreak order for linear only. Hidden in the editor UI for every other strategy because app_queue interprets penalty as a tier gate, not a sort key. |
| ~~Queue → Timeout (s)~~ — REMOVED from editor 2026-05-20 | queues.conf timeout= round budget | Was a list column + form field. Backend computes the effective round budget as max(queue.timeout, SUM(member ring_timeout_seconds) + 10) regardless, so any operator-typed value below that floor was a silent no-op (see Gotcha #2). DB column kept for back-compat; editor stopped exposing it in PR #251. |
| User → Active toggle | users.status | inactive users are skipped from queues.conf — they don't ring |
| User → Outbound CallerID | users.outbound_did | overrides the org default for trunk-dialled calls (including queue legs) |
Reload semantics¶
api/src/services/asterisk/configDeploymentService.js → reloadAsteriskConfiguration() is fired by ~18 server.js call sites (DID approve, queue save, user update, IVR save, …) every time configs are regenerated. The reload path was rewritten in PR #255 after two prod incidents (2026-05-19, 2026-05-20) where concurrent calls deadlocked res_pjsip and SIGKILL+restart was the only recovery — see Error 60.
Two invariants the rewrite enforces¶
1. Serialization in JS. reloadAsteriskConfiguration() is a thin wrapper that chains on a per-instance promise (this._reloadLock):
const previous = this._reloadLock.catch(() => {});
this._reloadLock = previous.then(() => this._doReload());
return this._reloadLock;
Concurrent callers queue behind the in-flight reload in JS. Exactly one asterisk -rx shell call is in flight at a time. The previous-rejection catch keeps a failed reload from poisoning subsequent ones.
2. Targeted reloads, not core reload. core reload reloaded every module that supports reload — including res_pjsip even when only an ext_*.conf file changed. The wider scope was directly correlated with the deadlock window. The new sequence only touches modules whose files the service actually rewrites:
asterisk -rx "dialplan reload" → ext_*.conf
asterisk -rx "module reload res_pjsip.so" → pjsip_*.conf
asterisk -rx "module reload app_queue.so" → queues_*.conf
Then a 750 ms pause before seedQueueMemberDevstates() so the per-member devstate change CLI commands don't race the tail-end of the reload sequence on Asterisk's CLI mutex.
Rules to keep this fixed¶
- Never call
core reloadfrom the API. Adding a new file type means adding its specific targeted reload command, not falling back tocore reload. - Don't issue raw
asterisk -rx "module reload …"calls anywhere else. Route throughreloadAsteriskConfiguration()so the serialization applies. - Smoke test pattern (also in test-cases.md → Queues): fire two concurrent
POST /api/v1/admin/regenerate-gatewayrequests ~300 ms apart, expect both to return 200 withdidCount/orgCount, expect two distinct🔄 Reloading…/✅ Asterisk configuration reloaded (dialplan + res_pjsip + app_queue + devstate seed)pairs inpm2 logs astrapbx, expect zero"previous reload command didn't finish yet"in/var/log/asterisk/full.log.
Backup locations on prod¶
When configs are regenerated by the API or by regen-all-org-configs.js, the previous files are NOT auto-backed-up. Manual backups taken during incident response are at:
/root/queues-bak-<timestamp>/— queues_*.conf snapshots/root/dialplan-bak-<timestamp>/— ext_*.conf snapshots/root/prod-bak-pre-*-<timestamp>/— pre-regen full snapshots
Take a fresh backup before any manual edit on /etc/asterisk/.
Related¶
- Dashboard — Call Pickup Time — operator-facing metric for queue performance. Computed from the answered queue child leg's
(duration - billsec); accurate for ringall queues (current hospital default), under-counts for sequential. Includes the three-tier health badge thresholds. - Concurrent Call Cap Architecture — org + per-trunk caps that gate outbound calls. The trunk cap interacts with queue ring-out via the
qm-helper.jspaths.