Skip to content

Queue Routing Architecture

Canonical reference for how inbound queue calls flow through the cloud Asterisk on Astradial, what the dialplan generator emits per member, and the non-obvious behaviours that have bitten us. Read this before touching dialplanGenerator.generateQueueMemberContext, queueService.generateSingleQueueConfig, or the auto-ticket classifier.

Call path for a queue call

PSTN → Tata NNI → NUC Asterisk → (WireGuard) → Cloud Asterisk:
  1. tata-did-route                 (match DID → org context)
  2. org_<ctx>__incoming             (Answer, MOH, CDR setup, MixMonitor)
  3. org_<ctx>__queue                (queue extension wrapper)
  4. Queue(org_<ctx>_<num>, ct, , , <max_wait>)
  5. queues.conf member  →  Local/qm<id>@org_<ctx>__qmem/n
  6. org_<ctx>__qmem qm<id> helper extension:
       - Set per-leg CDR + recording filename
       - Set CALLERID(num) to org's outbound DID  ← critical, see below
       - Dial(PJSIP/<phone>@<trunk>) or Dial(PJSIP/<endpoint>) or Stasis()
       - Hangup()
  7. Member's phone rings; on answer the call bridges via the Local channel pair

The per-member helper context org_<ctx>__qmem is what enables per-member ring_timeout_seconds, per-leg recording, and the outbound CallerID rewrite. Without the helper, queues.conf can only set one global ring time for all members of a queue.

Member dial types

Queue members emit one of three dial paths depending on users.ring_target and users.routing_type:

Type Condition qm helper emits
softphone ring_target='ext' and asterisk_endpoint present Dial(PJSIP/<endpoint>, <ringTime>, tT)
phone-target ring_target='phone' and phone_number present Set(CALLERID(num)=<org-DID>) then Dial(PJSIP/<10-digit>@<trunk>, <ringTime>, tT)
ai_agent routing_type='ai_agent' Stasis(pbx_api, ai_agent, <routing_destination>)

The state_interface written to queues.conf for each member follows the same partition:

  • softphone → PJSIP/<endpoint> (real device state, queue can reliably track busy)
  • phone-target / ai_agent → Custom:qm<id> (placeholder; nothing publishes its state)

Critical gotchas

1. Custom: devstates default to UNKNOWN, not NOT_INUSE

Asterisk reports any Custom:<name> device that has never had DEVICE_STATE() published as UNKNOWN. app_queue with ringinuse=no skips UNKNOWN members. The generator MUST emit ringinuse=yes whenever any active member uses Custom: state_interface — otherwise the queue silently never rings phone-target / ai_agent / no-endpoint members.

configDeploymentService.seedQueueMemberDevstates() publishes NOT_INUSE for every Custom:qm after each deploy/reload so the devstate is correct even when ringinuse=no is left in place.

2. queue.timeout is the ROUND budget, not per-member

timeout= in queues.conf governs the maximum time the queue will spend trying members in one round before falling back to retry. It is NOT a per-member cap. If member 1 has ring_time=60 and the queue.timeout=60, member 2 never gets a turn — the round budget is consumed by member 1 alone.

Generator computes the round budget as SUM(member ring_timeout_seconds) + 10s buffer. Operator does NOT see this field in the editor; it is derived purely from per-member ring times.

3. GotoIf(${X}=Y?...) is always TRUE — wrap in $[...]

Asterisk does NOT do string comparison on bare GotoIf conditions. After variable substitution it evaluates the string for truthiness, and any non-empty string is true. Every conditional emit MUST wrap the comparison in $[...]:

WRONG: GotoIf(${QUEUESTATUS}=TIMEOUT?timeout)        ← always jumps to timeout
RIGHT: GotoIf($[${QUEUESTATUS}=TIMEOUT]?timeout)     ← actual string equality

Same rule applies to DEVSTATE / DIALSTATUS / any string check. Tests in api/tests/sql-invariants.test.js enforce this generator-wide.

4. Outbound CallerID on phone-target queue members

When the qm helper dials the trunk for a phone-target member, the From header inherits the parent channel's CallerID — which is the external INBOUND caller's number (e.g., a customer dialing in). Tata's SBC will NOT accept a From that isn't a registered DID for that trunk: it substitutes with the trunk's GLOBAL default DID.

On prod this default is +918065978001, an unassigned staging-routed DID, which made it look like a cross-org leak (the dialled member's phone showed 918065978001 ringing them for an Om Chambers call). Fixed: qm helper now sets CALLERID(num) to the org's outbound DID before the trunk Dial:

Set(CALLERID(num)=<user.outbound_did or org.default_did or org.first_did>)
Dial(PJSIP/<phone>@<trunk>, <ringTime>, tT)

The selection priority matches generateUserExtension. org.dids is scoped to the org's assignments, so cross-org pickup is impossible by construction.

5. Native mobile "decline" is NOT a SIP REJECT

When the dialled number is a mobile via Tata trunk, native phone dialer behaviour for "decline" depends on the carrier:

  • Softphone (Zoiper, Linphone, Astradial app) → sends SIP 486 Busy Here → Asterisk Dial returns BUSY → queue advances in <1s
  • Native Indian mobile dialer (most carriers) → carrier intercepts; either rings out the full window before reporting NOANSWER, OR routes to operator voicemail (which Asterisk sees as ANSWERED and the queue bridges to)

No dialplan change can override this — the SBC is delivering what the carrier says. For fast decline-handling, the member needs a SIP softphone or a call-screening layer (e.g., "press 1 to accept"). Otherwise expect up to ring_timeout_seconds of wait before queue advance on a mobile decline.

6. penalty is a tier GATE for every strategy except linear — not a sort key

In Asterisk's app_queue the penalty field on a member => line has strategy-dependent semantics:

  • linear — file order + penalty is a sort tiebreak. Lower penalty rings first; distinct penalties order the members.
  • ringall / leastrecent / fewestcalls / random / rrmemory — penalty is a tier gate. Only members at the lowest currently-active penalty tier are eligible per round; higher-penalty members join later rounds after timeout (the round budget) elapses with no answer. Distinct penalties under any of these strategies reduce the queue to serial-by-priority instead of the operator-expected parallel ring.

The editor's "priority up/down arrows" mutate each queue_members.penalty field. Pre-fix, the generator wrote those penalty values verbatim into queues.conf regardless of strategy — so an operator reordering members on a ringall queue (say Landline, Raghavi, Pavithra) ended up with penalties 0, 1, 2 and Asterisk rang them one at a time across rounds. Reproduced 2026-05-18 on Thangavelu Hospital queue 5001 (ringall with 5 members at penalties 0–5) and Om Chambers queue 5003 (ringall with a single member at penalty 1).

Fix (PR #235 / 2026-05-18): queueService.generateQueueMemberString accepts the parent queue and flattens emitted penalty to 0 for every non-linear strategy, so all members sit at the same tier and ring in parallel. File order remains penalty-ascending (purely cosmetic, kept for UI/file-readability consistency). For linear, stored penalty values are preserved exactly. The editor's UI now hides the priority arrows whenever queue.strategy !== 'linear' to prevent operators from creating bad penalty data in the first place. Tests parameterise across all 5 non-linear strategies in api/tests/queue-service.test.js (Q15b/Q15c).

Operational follow-up after the fix shipped: node /opt/astrapbx/scripts/regen-all-org-configs.js was run on staging and prod to flush every active org's queues_*.conf so the on-disk files actually pick up penalty=0 (the script writes the files + does ONE AMI reload at the end — see Regen org configs workflow below). New emission-shape changes that affect generated configs should follow the same pattern.

Regen org configs workflow

PR #237 / 2026-05-18 added .github/workflows/regen-org-configs.yml — a manual workflow_dispatch button (Actions → "Regen org configs" → Run workflow → pick staging or production) that:

  1. Lands on the matching [self-hosted, <env>] runner (same labels as Deploy API workflows).
  2. Pre-flight grep aborts if the deployed queueService.js doesn't carry the expected fix signature.
  3. Backs up every /etc/asterisk/queues_*.conf to /root/queues-bak-<ts>/.
  4. Runs node /opt/astrapbx/scripts/regen-all-org-configs.js.
  5. Audits non-zero penalty members before and after (post-state should only show linear queue members with non-zero penalties).
  6. Prints the rollback path with the backup directory baked in.

Use this instead of manual SSH whenever an emission-shape change has landed and existing queues_*.conf need to be rewritten — it's audited in the Actions log, runs on the same trusted runner as the deploys, and produces the same outcome as the manual flow.

Post-Queue() routing

After Queue(...) returns, the dialplan branches on QUEUESTATUS:

GotoIf($[${QUEUESTATUS}=TIMEOUT]?timeout)        → caller waited > max_wait, route to timeout dest
GotoIf($[${QUEUESTATUS}=ANSWERED]?normal_end)    → call was answered, hang up cleanly
GotoIf($[${QUEUESTATUS}=CONTINUE]?normal_end)    → caller pressed digit to exit, hang up
Goto(unavail)                                    → empty / full / unavailable queue states

n(normal_end),Hangup()                           → clean end
n(timeout),Goto(<configured timeout destination>)
n(unavail),Playback(all-agents-busy) → Hangup    → "all agents are busy" announcement

If the caller hangs up while in queue, the channel is destroyed and Asterisk does NOT execute this post-Queue() block — the h-extension hangup handler runs instead. The labels above are only reached when the caller is still alive.

Auto-ticket classifier interaction

pollCdr (api/src/server.js, runs every 30s) reads new CDR rows, dedups by linkedid, and forwards one representative row per call to classifyAndUpsertTicket.

Three behaviours worth knowing:

  • 30-second grace window: rows whose calldate + duration is less than NOW() - 30s are skipped. This lets all retry CDRs for a single queue session settle in the DB before classification, so a retry-then-answered call never creates a false "Queue Timeout" ticket.
  • ANSWERED-preferring dedup: when multiple CDR rows share a linkedid (queue retried, multiple member attempts), the dedup picks the row where disposition='ANSWERED' AND billsec > 0 first, then longest duration as tiebreak.
  • Queue bridge recognition: a Local/qm<hex>@… dstchannel with lastapp='Queue' and billsec > 0 is a real bridge — classifier returns queue_answered, no ticket. Without this, every queue-answered call would create a missed-call ticket because the dstchannel isn't a direct PJSIP endpoint.
  • Cross-batch auto-close: if a NO_ANSWER row from round 1 creates a ticket before the round 2 ANSWERED row arrives, the classifier closes the open ticket when it later sees the answered row (matched by org + caller_number + last_call_id within 10 min).

Files in scope

File Purpose
api/src/services/asterisk/dialplanGenerator.js Emits the qm helper context, queue extension, user extension, IVR, hangup handler
api/src/services/asterisk/queueService.js Emits queues.conf (one section per queue + member => lines)
api/src/services/asterisk/configDeploymentService.js deployOrganizationConfiguration, reloadAsteriskConfiguration, seedQueueMemberDevstates
api/src/services/ticketClassifier.js Per-CDR-row decision: skip, ticket, or auto-close
api/src/server.js pollCdr, /api/v1/calls, /api/v1/calls/live, /api/v1/calls/:linkedId/journey, /api/v1/calls/:callId/recording
api/tests/*.test.js 125+ unit tests via node:test; run with npm test

Operator-facing knobs

Editor field Maps to What it actually does
Queue → Strategy queues.conf strategy linear / ringall / leastrecent / fewestcalls / random / rrmemory
Queue → Max Wait (sec) Queue() 5th arg total time caller sits in queue across all rounds → QUEUESTATUS=TIMEOUT
Queue → "On timeout, route caller to" (picker) timeout_destination_type + timeout_destination → post-Queue Goto Where caller routes when Max Wait expires. Smart picker introduced in PR #251/#252: kind buttons [ No routing | User | Queue | Phone ] + a contextual SearchableSelect per kind. The picker writes both type and destination atomically so they can never disagree — pre-fix, the two-field combo let operators save type=phone, destination=5004 (the supervisors queue extension), which made the generator dial 5004 out the trunk and Tata rejected it. See Error 59.
Queue → Member → Ring Timeout (sec) qm helper Dial(..., N, ...) this member's individual ring window per attempt
Queue → Member → Priority queues.conf member => …,<penalty>,… (only for linear strategy — see Gotcha #6) sort/tiebreak order for linear only. Hidden in the editor UI for every other strategy because app_queue interprets penalty as a tier gate, not a sort key.
~~Queue → Timeout (s)~~ — REMOVED from editor 2026-05-20 queues.conf timeout= round budget Was a list column + form field. Backend computes the effective round budget as max(queue.timeout, SUM(member ring_timeout_seconds) + 10) regardless, so any operator-typed value below that floor was a silent no-op (see Gotcha #2). DB column kept for back-compat; editor stopped exposing it in PR #251.
User → Active toggle users.status inactive users are skipped from queues.conf — they don't ring
User → Outbound CallerID users.outbound_did overrides the org default for trunk-dialled calls (including queue legs)

Reload semantics

api/src/services/asterisk/configDeploymentService.js → reloadAsteriskConfiguration() is fired by ~18 server.js call sites (DID approve, queue save, user update, IVR save, …) every time configs are regenerated. The reload path was rewritten in PR #255 after two prod incidents (2026-05-19, 2026-05-20) where concurrent calls deadlocked res_pjsip and SIGKILL+restart was the only recovery — see Error 60.

Two invariants the rewrite enforces

1. Serialization in JS. reloadAsteriskConfiguration() is a thin wrapper that chains on a per-instance promise (this._reloadLock):

const previous = this._reloadLock.catch(() => {});
this._reloadLock = previous.then(() => this._doReload());
return this._reloadLock;

Concurrent callers queue behind the in-flight reload in JS. Exactly one asterisk -rx shell call is in flight at a time. The previous-rejection catch keeps a failed reload from poisoning subsequent ones.

2. Targeted reloads, not core reload. core reload reloaded every module that supports reload — including res_pjsip even when only an ext_*.conf file changed. The wider scope was directly correlated with the deadlock window. The new sequence only touches modules whose files the service actually rewrites:

asterisk -rx "dialplan reload"               → ext_*.conf
asterisk -rx "module reload res_pjsip.so"    → pjsip_*.conf
asterisk -rx "module reload app_queue.so"    → queues_*.conf

Then a 750 ms pause before seedQueueMemberDevstates() so the per-member devstate change CLI commands don't race the tail-end of the reload sequence on Asterisk's CLI mutex.

Rules to keep this fixed

  • Never call core reload from the API. Adding a new file type means adding its specific targeted reload command, not falling back to core reload.
  • Don't issue raw asterisk -rx "module reload …" calls anywhere else. Route through reloadAsteriskConfiguration() so the serialization applies.
  • Smoke test pattern (also in test-cases.md → Queues): fire two concurrent POST /api/v1/admin/regenerate-gateway requests ~300 ms apart, expect both to return 200 with didCount/orgCount, expect two distinct 🔄 Reloading… / ✅ Asterisk configuration reloaded (dialplan + res_pjsip + app_queue + devstate seed) pairs in pm2 logs astrapbx, expect zero "previous reload command didn't finish yet" in /var/log/asterisk/full.log.

Backup locations on prod

When configs are regenerated by the API or by regen-all-org-configs.js, the previous files are NOT auto-backed-up. Manual backups taken during incident response are at:

  • /root/queues-bak-<timestamp>/ — queues_*.conf snapshots
  • /root/dialplan-bak-<timestamp>/ — ext_*.conf snapshots
  • /root/prod-bak-pre-*-<timestamp>/ — pre-regen full snapshots

Take a fresh backup before any manual edit on /etc/asterisk/.

  • Dashboard — Call Pickup Time — operator-facing metric for queue performance. Computed from the answered queue child leg's (duration - billsec); accurate for ringall queues (current hospital default), under-counts for sequential. Includes the three-tier health badge thresholds.
  • Concurrent Call Cap Architecture — org + per-trunk caps that gate outbound calls. The trunk cap interacts with queue ring-out via the qm-helper.js paths.