Customer Tunnels (WireGuard)¶

Per-org WireGuard tunnels that give customer PBXes a stable, encrypted path into Astradial's cloud Asterisk — eliminating CGNAT, multi-WAN failover, and dynamic-IP problems at the network layer.

Managed through the editor (editor.astradial.com), not by hand-editing config files. Per-org, version-controlled, auditable.

Why this exists¶

Customers like V7 sit behind dynamic/CGNAT public IPs and often have multi-WAN failover. From the cloud's perspective, the customer's public IP appears to rotate between several values — sometimes within seconds. This breaks SIP in well-known ways:

NAT pinholes die and get re-bound on different ports
Cloud Asterisk's stored contact for a phone points at an IP that no longer routes
Inbound calls during the recovery window go to the wrong place or nowhere
Each phone independently fights the problem

A WireGuard tunnel between the customer's site router and our cloud gives both sides a fixed, stable tunnel IP regardless of what's happening at the public-IP layer. The customer's site router (e.g., Grandstream GWN7002) initiates the tunnel; our cloud accepts it; SIP traffic flows through the tunnel and the cloud sees the customer at a stable address forever.

WireGuard is identity-based, not IP-based: when the customer's WAN flips or CGNAT rotates the public IP, the tunnel survives because authentication is by cryptographic key, not by IP.

Security model (read this first)¶

Customer tunnels handle external traffic and must be strictly isolated from Astradial's internal infrastructure.

Layered defenses¶

                           PUBLIC INTERNET
                                 │
                                 │
        ┌────────────────────────┼────────────────────────┐
        │                                                 │
        │  CLOUD VPS (147.93.168.216)                      │
        │                                                 │
        │  ┌──────────────────┐    ┌──────────────────┐   │
        │  │  wg0 (existing)  │    │  wg1 (NEW)       │   │
        │  │  UDP 51820       │    │  UDP 51821       │   │
        │  │  10.10.10.0/24   │    │  10.20.0.0/16    │   │
        │  │                  │    │                  │   │
        │  │  • NUC           │    │  • V7            │   │
        │  │  • Staging       │    │  • Future cust N │   │
        │  └────────┬─────────┘    └────────┬─────────┘   │
        │           │                       │             │
        │           └──── iptables DROP ────┘             │
        │                  (no cross-traffic)              │
        │                                                 │
        │  Cloud services (Asterisk, AstraPBX, etc.)      │
        └─────────────────────────────────────────────────┘

Defense layers¶

Layer	Mechanism	Purpose
1. Separate interface	`wg1` distinct from existing `wg0`	Cleanly isolates customer traffic from internal infra
2. Separate UDP port	`wg1` listens on 51821 (not 51820)	Per-port firewall rules + audit; fail2ban distinguishes attack surfaces
3. Crypto-tight `AllowedIPs`	Each peer's AllowedIPs = single `/32` of their tunnel IP	WireGuard refuses spoofed source IPs
4. iptables FORWARD drops	`wg1↔wg0` and `wg1↔wg1` are explicit DROP	Customer can't reach NUC, staging, or other customers — even if forwarding is enabled
5. iptables INPUT scope	wg1 traffic only allowed to SIP/RTP ports	Customer can't reach SSH (22), AstraPBX API (8000), editor (3001), MariaDB (3306)
6. PJSIP endpoint `match`	V7's PJSIP endpoints match `10.20.7.0/30` only	Customer's tunnel traffic can only reach their own org's extensions
7. API-level RBAC	Tunnel CRUD requires `admin` role on the org	Customers can't create/modify tunnels themselves
8. Private key never exposed via API	Server's WG private key stays on disk; only public key + PSK returned to UI	Compromise of editor doesn't compromise server identity
9. Audit log	Every tunnel CRUD writes to `audit_log` table	Forensic trail of who did what
10. Input validation	Pubkey format, allowed-IP CIDR, name regex all validated	Reject malformed input early

What a customer CAN reach via their tunnel¶

✅ Cloud's wg1 tunnel IP (e.g. 10.20.7.1) — for SIP signaling/RTP only
✅ Cloud Asterisk on 5060/UDP, 5080/UDP, 10000-20000/UDP (RTP)

What a customer CANNOT reach¶

❌ NUC (10.10.10.2)             — different interface + iptables DROP
❌ Staging (10.10.10.3)         — different interface + iptables DROP
❌ Other customer tunnels       — wg1-to-wg1 iptables DROP
❌ SSH (22)                     — INPUT policy on wg1
❌ AstraPBX API (8000)          — INPUT policy on wg1
❌ Editor (3001)                — INPUT policy on wg1
❌ MariaDB / Postgres / Redis   — bind to localhost only + INPUT policy
❌ Any other org's PJSIP endpoints — PJSIP `match` is per-org /30

Architecture¶

Subnet allocation¶

Range	Use	Notes
`10.10.10.0/24`	Internal infra (`wg0`)	EXISTING. NUC (`.2`), Staging (`.3`). Do not add customers here.
`10.20.0.0/16`	Customer tunnels (`wg1`)	New pool. Allocator picks next free `/30` per customer.
`10.20.N.0/30`	Per-customer `/30`	Cloud peer `.1`, customer peer `.2`. (`.0` net, `.3` broadcast unused.)

Per-customer `/30` layout (example: V7)¶

10.20.7.0/30:
  10.20.7.0  → network address (unused)
  10.20.7.1  → cloud-side tunnel IP
  10.20.7.2  → customer-side tunnel IP
  10.20.7.3  → broadcast (unused)

Each customer's /30 is independently allocated by the API at tunnel creation time.

Components¶

Component	Type	Lives in
`customer_tunnels` table	DB (MariaDB)	`pbx_api_db`
`CustomerTunnel` Sequelize model	Backend	`api/src/models/CustomerTunnel.js`
`customer-tunnels.js` routes	API	`api/src/routes/customer-tunnels.js`
`wireguardGenerator.js` service	Backend	`api/src/services/asterisk/wireguardGenerator.js`
`wireguardApplier.js` service	Backend	`api/src/services/asterisk/wireguardApplier.js`
Subnet allocator	Backend service	`api/src/services/network/subnetAllocator.js`
Network Tunnels UI	Frontend	`editor/app/dashboard/[orgId]/settings/page.tsx` (new tab)
`wg1` interface	System (per-VPS)	`/etc/wireguard/wg1.conf` (generated by `wireguardGenerator`)
iptables rules	System (per-VPS)	`/etc/iptables/customer-tunnels.v4` (set once at bootstrap)

Database schema¶

// api/database/migrations/YYYYMMDDhhmmss-create-customer-tunnels.js
'use strict';

module.exports = {
  up: async (queryInterface, Sequelize) => {
    await queryInterface.createTable('customer_tunnels', {
      id: {
        type: Sequelize.UUID,
        defaultValue: Sequelize.UUIDV4,
        primaryKey: true,
      },
      org_id: {
        type: Sequelize.UUID,
        allowNull: false,
        references: { model: 'organizations', key: 'id' },
        onDelete: 'CASCADE',
      },
      name: {
        type: Sequelize.STRING(64),
        allowNull: false,
      },
      tunnel_subnet: {
        type: Sequelize.STRING(18),       // CIDR (e.g. "10.20.7.0/30")
        allowNull: false,
        unique: true,
      },
      cloud_tunnel_ip: {
        type: Sequelize.STRING(15),
        allowNull: false,
      },
      customer_tunnel_ip: {
        type: Sequelize.STRING(15),
        allowNull: false,
      },
      customer_pubkey: {
        type: Sequelize.STRING(64),       // base64-encoded 32-byte WG pubkey
        allowNull: false,
      },
      preshared_key: {
        type: Sequelize.STRING(64),
        allowNull: false,
      },
      persistent_keepalive: {
        type: Sequelize.INTEGER,
        defaultValue: 25,
      },
      listen_port: {
        type: Sequelize.INTEGER,
        defaultValue: 51821,
      },
      interface_name: {
        type: Sequelize.STRING(16),
        defaultValue: 'wg1',
      },
      status: {
        type: Sequelize.ENUM('active', 'disabled', 'revoked'),
        defaultValue: 'active',
        allowNull: false,
      },
      notes: {
        type: Sequelize.TEXT,
      },
      created_at: { type: Sequelize.DATE, defaultValue: Sequelize.NOW },
      updated_at: { type: Sequelize.DATE, defaultValue: Sequelize.NOW },
      created_by_user_id: {
        type: Sequelize.UUID,
        references: { model: 'org_users', key: 'id' },
        onDelete: 'SET NULL',
      },
    });

    await queryInterface.addIndex('customer_tunnels', ['org_id']);
    await queryInterface.addIndex('customer_tunnels', ['status']);
    await queryInterface.addIndex('customer_tunnels', ['org_id', 'name'], { unique: true });
  },

  down: async (queryInterface) => {
    await queryInterface.dropTable('customer_tunnels');
  },
};

Notes: - customer_pubkey and preshared_key are stored in plaintext, consistent with how sip_trunks.password is stored today (see v7-setup.md and existing models). Plain WireGuard pubkeys are not secrets; PSKs are. TODO: encrypt PSKs at rest as part of a broader DB-secret-encryption initiative. - org_id cascades on org deletion (tunnel automatically removed if org is deleted). - Unique on (org_id, name) prevents duplicate names within an org; unique on tunnel_subnet prevents subnet collision globally.

API surface¶

Routes live in api/src/routes/customer-tunnels.js, mounted at /api/v1/customer-tunnels. They follow the same auth + org-scoping pattern as queues.js (JWT middleware → req.user.org_id filter on every query).

Method	Path	RBAC	Purpose
`GET`	`/api/v1/customer-tunnels`	`admin`	List tunnels for the requesting org
`GET`	`/api/v1/customer-tunnels/:id`	`admin`	Get one tunnel, including live status (last handshake, bytes transferred)
`POST`	`/api/v1/customer-tunnels`	`admin`	Create — body `{ name, customer_pubkey, notes? }`. Server allocates `/30`, generates PSK, writes wg1.conf, returns full record including the Peer config block for the customer to paste on their side
`PATCH`	`/api/v1/customer-tunnels/:id`	`admin`	Update — `{ status, notes }`. Switching to `disabled` removes peer from wg1 but keeps DB row.
`DELETE`	`/api/v1/customer-tunnels/:id`	`admin`	Revoke — removes peer from wg1, marks status `revoked`. Subnet is NOT immediately reused (kept reserved for 30 days for audit).
`GET`	`/api/v1/customer-tunnels/:id/customer-config`	`admin`	Returns the customer-side `[Peer]` block as plain text for copy/paste into GDMS

Every route runs the existing JWT + RBAC middleware (requireRole('admin')), validates input via express-validator, and writes an entry to audit_log on every mutation.

Subnet allocator (`api/src/services/network/subnetAllocator.js`)¶

const POOL_CIDR = '10.20.0.0/16';
const PREFIX_LENGTH = 30;

async function allocateNextAvailable() {
  // Get all in-use subnets ordered ascending
  const used = await CustomerTunnel.findAll({
    where: { status: ['active', 'disabled'] },  // revoked subnets reserved for 30d
    attributes: ['tunnel_subnet'],
    order: [['tunnel_subnet', 'ASC']],
  });

  // Walk 10.20.0.0/30, 10.20.0.4/30, ... 10.20.0.252/30, 10.20.1.0/30, ...
  // First-fit: return the first /30 in pool not present in `used`
  // Skip 10.20.0.0/30 (reserved, network boundary)

  for (let octet3 = 0; octet3 < 256; octet3++) {
    for (let octet4 = 4; octet4 < 256; octet4 += 4) {  // /30 = 4 IPs aligned
      const candidate = `10.20.${octet3}.${octet4}/30`;
      if (!used.find((u) => u.tunnel_subnet === candidate)) {
        return {
          subnet: candidate,
          cloud_ip: `10.20.${octet3}.${octet4 + 1}`,
          customer_ip: `10.20.${octet3}.${octet4 + 2}`,
        };
      }
    }
  }
  throw new Error('Subnet pool exhausted (10.20.0.0/16 fully allocated)');
}

Capacity: 10.20.0.0/16 has 16,384 /30s. We will not exhaust this in any realistic scenario.

WireGuard config generator (`api/src/services/asterisk/wireguardGenerator.js`)¶

Mirrors the pattern of dialplanGenerator.js. Reads all active and disabled (but with peers omitted) tunnels from DB, emits the full wg1.conf from scratch on every regeneration:

async function generateWg1Config() {
  const serverPrivateKey = await readServerPrivateKey();  // /etc/wireguard/wg1.private (root-only)
  const tunnels = await CustomerTunnel.findAll({
    where: { status: 'active' },
    order: [['created_at', 'ASC']],
  });

  let conf = `# AUTO-GENERATED by AstraPBX wireguardGenerator. DO NOT EDIT BY HAND.
# Source of truth: customer_tunnels table in pbx_api_db.
# Generated: ${new Date().toISOString()}

[Interface]
Address = 10.20.0.1/16
ListenPort = 51821
PrivateKey = ${serverPrivateKey}
PostUp = /usr/local/sbin/customer-tunnels-iptables.sh up
PostDown = /usr/local/sbin/customer-tunnels-iptables.sh down

`;

  for (const t of tunnels) {
    conf += `# org=${t.org_id} name=${t.name} created=${t.created_at.toISOString()}
[Peer]
PublicKey = ${t.customer_pubkey}
PresharedKey = ${t.preshared_key}
AllowedIPs = ${t.customer_tunnel_ip}/32
PersistentKeepalive = ${t.persistent_keepalive}

`;
  }

  return conf;
}

Server's WG private key is generated once during the wg1 bootstrap (see Bootstrap procedure) and stays in /etc/wireguard/wg1.private with chmod 600 root:root. Never returned via API. Never logged.

WireGuard applier (`api/src/services/asterisk/wireguardApplier.js`)¶

async function applyWg1() {
  const conf = await generateWg1Config();

  // 1. Atomic write to staging file
  const tmp = '/etc/wireguard/wg1.conf.new';
  await fs.writeFile(tmp, conf, { mode: 0o600 });

  // 2. Backup current
  const backup = `/etc/wireguard/wg1.conf.bak-${Date.now()}`;
  await fs.copyFile('/etc/wireguard/wg1.conf', backup).catch(() => {});

  // 3. Move into place atomically
  await fs.rename(tmp, '/etc/wireguard/wg1.conf');

  // 4. Hot-reload (no tunnel restart for unchanged peers)
  await exec('wg syncconf wg1 <(wg-quick strip wg1)', { shell: '/bin/bash' });

  // 5. Verify wg show succeeds
  const { stdout } = await exec('wg show wg1');
  return { applied: true, peer_count: tunnels.length, wg_status: stdout };
}

wg syncconf is a hot-reload that only changes peers that differ — existing tunnels are not disrupted when a new peer is added. Backups in /etc/wireguard/wg1.conf.bak-<ts> are retained 30 days then garbage-collected by a cron job.

Bootstrap procedure (per-VPS, one-time)¶

This is infrastructure setup, not feature code. Done once on staging during development, then once on prod before the feature ships. Documented here as a runbook.

# 1. Generate server WG keys (once per VPS)
umask 077
wg genkey > /etc/wireguard/wg1.private
wg pubkey < /etc/wireguard/wg1.private > /etc/wireguard/wg1.public
chmod 600 /etc/wireguard/wg1.private
chmod 644 /etc/wireguard/wg1.public

# 2. Write iptables helper script
cat > /usr/local/sbin/customer-tunnels-iptables.sh <<'EOF'
#!/bin/bash
# Applies iptables rules when wg1 comes up; removes them when wg1 goes down.
# Called from PostUp/PostDown in /etc/wireguard/wg1.conf.

set -e

case "$1" in
  up)
    # Block customer→internal infra
    iptables -A FORWARD -i wg1 -o wg0 -j DROP
    iptables -A FORWARD -i wg0 -o wg1 -j DROP
    # Block customer→customer
    iptables -A FORWARD -i wg1 -o wg1 -j DROP
    # Allow only SIP+RTP into the cloud
    iptables -A INPUT -i wg1 -p udp --dport 5060 -j ACCEPT
    iptables -A INPUT -i wg1 -p udp --dport 5080 -j ACCEPT
    iptables -A INPUT -i wg1 -p udp --dport 10000:20000 -j ACCEPT
    # Drop everything else from wg1
    iptables -A INPUT -i wg1 -j DROP
    ;;
  down)
    iptables -D FORWARD -i wg1 -o wg0 -j DROP || true
    iptables -D FORWARD -i wg0 -o wg1 -j DROP || true
    iptables -D FORWARD -i wg1 -o wg1 -j DROP || true
    iptables -D INPUT -i wg1 -p udp --dport 5060 -j ACCEPT || true
    iptables -D INPUT -i wg1 -p udp --dport 5080 -j ACCEPT || true
    iptables -D INPUT -i wg1 -p udp --dport 10000:20000 -j ACCEPT || true
    iptables -D INPUT -i wg1 -j DROP || true
    ;;
esac
EOF
chmod 755 /usr/local/sbin/customer-tunnels-iptables.sh

# 3. Initial empty wg1.conf (no peers yet — AstraPBX will populate via wireguardGenerator)
cat > /etc/wireguard/wg1.conf <<EOF
[Interface]
Address = 10.20.0.1/16
ListenPort = 51821
PrivateKey = $(cat /etc/wireguard/wg1.private)
PostUp = /usr/local/sbin/customer-tunnels-iptables.sh up
PostDown = /usr/local/sbin/customer-tunnels-iptables.sh down
EOF
chmod 600 /etc/wireguard/wg1.conf

# 4. Open UDP 51821 in ufw
ufw allow 51821/udp comment 'WireGuard customer tunnels'

# 5. Enable and start the service
systemctl enable --now wg-quick@wg1

# 6. Verify
wg show wg1
ip link show wg1
iptables -L FORWARD -v | grep wg1

This bootstrap is run by the deploy script during initial feature rollout (in scripts/setup/wg1-bootstrap.sh in the monorepo), with explicit --dry-run option for safety.

Editor UI¶

A new tab on the Org Settings page (editor/app/dashboard/[orgId]/settings/page.tsx):

Org Settings
├── Organization (existing)
├── Asterisk Configuration (existing)
├── Session (existing)
└── Network Tunnels (NEW)
     ├── List of active tunnels for this org
     │   └── For each: name, subnet, last handshake, status, [⋮ actions]
     ├── [+ Add Tunnel] button → modal:
     │     - Name (default "astradial-<orgshortname>")
     │     - Customer pubkey (textarea — paste from GDMS)
     │     - Notes (optional)
     │     [Submit]
     └── [View customer config] action → modal:
           - Renders the [Peer] block for the customer to paste in GDMS
           - "Copy to clipboard" button
           - Pre-shared key shown with reveal/hide toggle

UI calls the new API via pbxCustomerTunnels.list(), pbxCustomerTunnels.create(), etc. — added to the existing lib/pbx/client.ts.

Staging-first rollout plan¶

Step	What	Where	Reversible?
1	Run bootstrap on staging VPS	`94.136.188.221`	Yes — remove wg1 service + uninstall iptables script
2	Merge feature branch (DB migration + backend) to `staging` branch	GitHub PR	Yes — revert PR
3	CI deploys backend to staging VPS	auto	—
4	Run migration on staging DB	`npx sequelize-cli db:migrate`	Yes — `db:migrate:undo`
5	Merge frontend feature branch to `staging`	GitHub PR	Yes — revert PR
6	E2E test: create a test tunnel via UI for a test org, configure a test client (e.g., a Linux box with `wireguard-tools`), verify tunnel establishes	Staging	—
7	E2E test: simulate a tunnel-IP SIP registration arrives at staging Asterisk, verify endpoint match works	Staging	—
8	Run for 24h, monitor `wg show wg1`, monitor staging Asterisk logs for unexpected drops	Staging	—
9	Bootstrap prod VPS	`147.93.168.216`	Yes — same removal procedure
10	Merge `staging` → `main`	GitHub PR	Yes — revert PR
11	CI deploys to prod, runs migration on prod DB	auto	Yes — `db:migrate:undo` on prod
12	Onboard V7 as the first real customer via the UI	Editor on prod	Yes — disable tunnel + revert phones to direct cloud registration

No prod changes until Steps 1–8 are clean on staging.

Production rollout — actual execution log¶

The order in the table above was theoretical. Actual prod rollout used a slightly different order to avoid a sequelize.sync() race (prod's server.js line 6223 still calls sync() at boot — fixed in PR #138 but not yet on main). Running migrations BEFORE the code merge means tables exist before any sync() runs, avoiding MariaDB 11 FK collision (1061) that bit us in PRs #125 and #131.

Stage	What	Result
1	Backup prod DB	`/root/pre-customer-tunnels-2026-05-12T115633Z/pbx_api_db.sql.gz`, 290 KB, 30 tables, gzip verified
2	SCP migration files to `/opt/astrapbx/database/migrations/`, run `npx sequelize-cli db:migrate`	Both migrated cleanly (0.239s + 0.080s). `customer_tunnels` + `tunnel_metrics` exist, empty, SequelizeMeta updated. `astrapbx` pm2 process untouched (no restart).
3	Merge `staging` → `main` (PR #140, 28 commits, +7.2k LOC)	CI green: `Deploy API to production` 25s, `Check API routes` ✓. Editor deploy queued (self-hosted runner backlog, not blocking). `pm2 reload astrapbx` graceful — PID rotated, uptime fresh, zero dropped requests. `WireGuard status poller started (60s interval)` confirmed at startup. wg-poller logs `[wg-poller] cycle failed: Command failed: wg show wg1 dump` once per minute as expected (wg1 absent until Stage 4). CDR poller resumed at last ID 4246 (PR #139 fix verified). AMI + ARI connections re-established. `/health` 200. `/api/v1/customer-tunnels` returns 401 unauthorized (route mounted, auth middleware firing). DB unchanged: both tables still empty. `CLOUD_PUBLIC_IP` set on prod `.env` before merge so reload picked it up.
4	Run `wg1-bootstrap.sh` on prod VPS	`--check` and `--dry-run` clean (port 51821 free, 10.20.0.0/16 unrouted, wg0 present, all 7 pre-flight PASS). Live run exit 0 at 17:45:28 IST: keypair generated, helper installed at `/usr/local/sbin/customer-tunnels-iptables.sh` (mode 755), `wg1.conf` written (mode 600, interface block + 0 peers), `ufw allow 51821/udp` added, `wg-quick@wg1` enabled+active. Verification: `wg show wg1` returns 0 peers / port 51821, `ip a wg1` shows UP at 10.20.0.1/16, iptables FORWARD has 3 DROP rules (wg1↔wg0 + wg1↔wg1), iptables INPUT has 3 ACCEPT (5060/5080/10000-20000) + 1 catch-all DROP, syslog `wg1-iptables` logged all 7 rule additions + "up: complete". wg-poller transitioned silently — error log mtime frozen at 17:44:52, no new failures after wg1 came up. wg0 untouched (NUC peer still handshaking ~20s fresh). Asterisk untouched (0 channels during run). Server public key: `0Dfkqmj3UFLCN4mmG+Cp2j7VfP4J75iOyA+AZUxKQng=` (paste into customer router peer config). Transcript: `/var/log/wg1-bootstrap-20260512-174526.log` on prod.
5	~~Set `CLOUD_PUBLIC_IP=147.93.168.216` in prod `.env`~~	DONE (folded into Stage 3 pre-flight)
6	Onboard V7 as first real customer via Editor UI	FULLY WORKING after PR #146 + PR #148. Tunnel V7_Tirupathur created (subnet `10.20.0.0/30`, customer pubkey `ylkqY4S7ahWWAD3L2m10dv2r+TRflPupWxCwfZ51hAI=`, customer_lan_cidr `192.168.0.0/24` set via Editor "Edit Tunnel" dialog). Reception phone (ext 09) registers via tunnel from `192.168.0.76` → `10.20.0.1:5080`, RTT ~172ms, active call confirmed (`Endpoint: org_moijhj2l__09/09 Ringing 1 of inf`). Endpoint roamed `103.197.113.158:33252` → `120.60.105.158:51820` mid-session (WG roaming verified ✓). Failover policy "BSNL-Primary-Rail-Backup" configured in GDMS Internet Source. Mac softphone (`org_mo8vbv60__1003` + `org_moijhj2l__01`) registered via public path on UDP transport — TCP transport failed qualify (transport mismatch with V7 endpoints which use `transport-udp`).

Architectural issues found during V7 onboarding (all resolved)¶

#	Issue	Severity	Resolution
1	Per-customer `cloud_tunnel_ip` varies by /30 but wg1 only binds 10.20.0.1	P0 — would have blocked customer #2	PR #145 — Editor customer-config now always returns `10.20.0.1` regardless of the customer's /30
2	Customer-config recommended `147.93.168.216/32` in AllowedIPs causing GWN7002 routing-loop rejection	P1 — blocked any Grandstream-router customer	PR #144 — `cloud_routed_ips` defaults to empty; only the tunnel-side IP is added
3	No SNAT/MASQUERADE on customer side — server-side WG cryptokey routing rejected packets from customer LAN	P0 — blocked all phone registrations via tunnel	PR #146 — new `customer_lan_cidr` field, validated server-side, expanded into server peer's AllowedIPs; Editor "Edit Tunnel" UI to set it post-create
4	`wg syncconf` updates AllowedIPs but NOT kernel routing table — responses to customer LAN went out eth0 instead of wg1	P0 — found mid-session, manually unblocked via `ip route add 192.168.0.0/24 dev wg1`	PR #148 — `syncCustomerLanRoutes` in the applier auto-manages kernel routes on every tunnel apply (add/remove with proper diff, idempotent, defense-in-depth against shell injection)

Softphone gotchas (operator-level)¶

Discovered during V7 onboarding. Worth checking first when a softphone fails to register or shows "qualify failed":

SIP Transport mismatch. Astradial's PJSIP endpoints (per-org PJSIP confs) are configured with transport=transport-udp. If a softphone client defaults to TCP (Telephone.app on macOS does this on first install), the REGISTER may succeed but qualify (OPTIONS keepalive) fails with no clear error — endpoint shows Unavailable even though the contact is in the AOR. Fix: set SIP Transport to UDP in the softphone's account settings.
fail2ban bans after repeated wrong-auth attempts. Three consecutive failed auths trip asterisk-auth jail and the source IP gets blocked at iptables (BEFORE Asterisk sees subsequent packets). Symptom: registration just goes silent, no Asterisk log. Check: ssh root@147.93.168.216 'fail2ban-client status asterisk-auth'. Unban: fail2ban-client set asterisk-auth unbanip <ip>.
NAT keep-alive for inbound calls. Softphones behind CGNAT/NAT need to send periodic keep-alive packets (every 25-30s) so the NAT mapping stays open and Asterisk's OPTIONS qualify round-trips. Most apps have a "NAT Keep-Alive" setting. Without it: registration succeeds but Contact Status shows NonQual with -nan RTT, and inbound calls never ring.
Per-extension credentials are org-scoped. SIP User ID = <ext> (e.g., 09); Authentication ID = org_<org-prefix>__<ext> (e.g., org_moijhj2l__09). If the softphone uses just 09 as auth, Asterisk's PJSIP can't find a matching endpoint and replies No matching endpoint found (logged as 401 in PJSIP logger, but with empty AOR match).

Operations — route inspection and reboot behavior¶

After PR #148, kernel routes for customer LANs are auto-managed by the applier. Two things to know:

How routes flow at runtime: - Editor → POST/PATCH/DELETE /customer-tunnels → applyWg1Config() → 1. Renders wg1.conf (peers include customer_lan_cidr in AllowedIPs) 2. Atomically writes + wg syncconf wg1 <(wg-quick strip wg1) (cryptokey layer) 3. syncCustomerLanRoutes() diffs desired vs current routes via ip -4 route show dev wg1, then ip route add/del to converge - Result is returned to caller as apply.route_sync.{added, removed, unchanged, errors} and surfaced in the Editor UI as a warning toast if any errors occurred.

Inspect routes manually:

ssh root@147.93.168.216 'ip -4 route show dev wg1'
# Expect:
#   10.20.0.0/16 proto kernel scope link src 10.20.0.1   ← wg-quick manages this
#   192.168.0.0/24 scope link                            ← V7 LAN, syncCustomerLanRoutes manages this

Reboot behavior: - ip route add is kernel runtime state (not persisted to disk) - On reboot, wg-quick@wg1.service starts before astrapbx and reads /etc/wireguard/wg1.conf - For each peer, wg-quick installs routes for the AllowedIPs entries — so 192.168.0.0/24 gets re-added automatically - Persistence is therefore at the wg1.conf level (which we DO write to disk via the applier's atomic-write protocol); the kernel routing table is rebuilt at boot from that source of truth

Recovery for V7 if route somehow goes missing without a reboot:

# Any tunnel mutation triggers route-sync; simplest is a no-op PATCH:
curl -X PATCH https://devpbx.astradial.com/api/v1/customer-tunnels/<V7-ID> \
  -H "Authorization: Bearer <token>" -H "Content-Type: application/json" \
  -d '{"notes":"trigger route resync"}'
# Or manually:
ssh root@147.93.168.216 'ip route add 192.168.0.0/24 dev wg1'

Diagnostic findings from V7 session¶

WG kernel statistics confirmed cryptokey routing rejection: cat /sys/class/net/wg1/statistics/rx_errors was 48 after V7 phone tried to register, while rx_dropped stayed 0 → confirms encrypted packets arrived but failed source-IP validation.
tcpdump on eth0 udp port 51821 showed 656-byte packets from 120.60.105.158 (V7's CGNAT IP) — these are encrypted SIP REGISTER attempts that get dropped post-decryption.
tcpdump on wg1 showed 0 packets matching the SIP filter — confirming nothing made it past the WG layer.
WG endpoint roaming verified: V7's apparent source IP changed from 103.197.113.158:33252 (CGNAT'd, ephemeral port) to 120.60.105.158:51820 (different CGNAT pool, listen port) mid-session without any tunnel disruption.
Asterisk PJSIP transports bind 0.0.0.0:5080 (UDP+TCP) and 0.0.0.0:5060 (UDP) — so destination-side wouldn't be the bottleneck if SNAT were correct.

Bug fixes shipped during V7 onboarding session¶

PR	Bug	Fix
#142	`subnetAllocator.js` destructured `Sequelize` from `models` registry — but registry doesn't expose it. POST /customer-tunnels 500'd with `TypeError: Cannot destructure property 'Op' of 'Sequelize' as it is undefined`.	Import `Op` directly from `sequelize` package. Added 4 regression tests exercising DB-aware code paths with real-shape mock.
#143	`CustomerTunnel.scope('withSecrets')` produced SQL with `preshared_key` listed twice (default scope's `SELECT *` already had it, plus the scope's `include: ['preshared_key']` added it again). mariadb driver rejected with `Error in results, duplicate field name preshared_key`. Caused applier to fail mid-create → tunnel marked `status=disabled`.	Change `withSecrets` to `attributes: { exclude: [] }` (clears default exclusion) instead of trying to re-include.

Interim workaround applied for V7¶

After PR #143 landed, V7's customer_tunnels row had status=disabled from the earlier failed apply. Recovery:

ssh root@147.93.168.216 'mariadb -uroot pbx_api_db -e "UPDATE customer_tunnels SET status=\"active\" WHERE name=\"V7_Tirupathur\";"'
ssh root@147.93.168.216 'cd /opt/astrapbx && node -e "const { applyWg1Config } = require(\"./src/services/network/wireguardApplier\"); applyWg1Config({ models: require(\"./src/models\") }).then(r => { console.log(JSON.stringify(r, null, 2)); process.exit(0); }).catch(e => { console.error(e.message); process.exit(1); });"'

Should be turned into a proper recovery endpoint (e.g., POST /:id/retry-apply) in a future PR.

Migration verification on prod (run after Stage 2):

SHOW TABLES LIKE 'customer_tunnels';     -- expects 1 row
SHOW TABLES LIKE 'tunnel_metrics';       -- expects 1 row
SELECT COUNT(*) FROM customer_tunnels;   -- expects 0
SELECT COUNT(*) FROM tunnel_metrics;     -- expects 0
SELECT name FROM SequelizeMeta
  WHERE name LIKE '2026051212%' OR name LIKE '2026051220%';  -- expects 2 rows

Indexes confirmed on prod: customer_tunnels_org_name_unique (unique compound org_id, name), customer_tunnels_status, tunnel_subnet (unique), tunnel_metrics_tunnel_snapshot (compound tunnel_id, snapshot_at), tunnel_metrics_snapshot_at. FK auto-indexes covered by the compound unique on org_id first column — exactly the pattern that survived the MariaDB 11 gotcha.

V7 onboarding playbook (after feature ships)¶

1. Ops opens editor.astradial.com → V7 org → Settings → Network Tunnels
2. Click [+ Add Tunnel]
3. Name: "astradial-cloud" (default)
4. Paste V7's WireGuard public key (from GDMS — generated when V7 IT created the WG entry on GWN7002)
5. Submit
   → System allocates 10.20.7.0/30
   → System generates PSK
   → System writes peer block to wg1.conf and reloads
   → System updates V7's PJSIP endpoint match list to include 10.20.7.2/32 (via generator regen)
   → Returns success + customer-config modal
6. Click [View customer config], copy the [Peer] block
7. In GDMS → V7's network → Settings → VPN → WireGuard → Add (or Setup Wizard):
   - Interface: BSNL
   - Local IP: 10.20.7.2/30
   - Paste server config including Endpoint = 147.93.168.216:51821
   - Save & Apply
8. On cloud: `wg show wg1` should show V7's peer with a fresh handshake within ~10s
9. Verify V7's phones now register from 10.20.7.2 (visible in `pjsip show contacts`)

Observed failure modes¶

Dynamic public IP → intermittent "person at extension N is not available" + Grandstream BLF turning amber¶

Investigated 2026-05-24 against V7 Tirupathur. Observed symptom: operators on Grandstream IP phones inside V7's office saw the BLF light for a neighbouring extension turn amber/orange, and any inbound call to that extension played "person at extension N is not available" — both intermittent, both clearing on their own within a minute or two.

Root cause: V7's ISP hands them a dynamic public IP that gets reassigned multiple times per day. Each reassignment forces WireGuard to re-handshake from the new public IP, and during the recovery window (until the next keepalive cycle or REGISTER refresh hits cloud Asterisk through the re-established tunnel), the phones are unreachable from cloud's perspective.

Evidence from tunnel_metrics for V7 over 7 days:

Distinct WireGuard endpoint IPs:  6  (59.93.241.16, 117.251.44.75, 117.251.40.215,
                                       120.60.103.165, 182.60.16.181, 182.60.27.117)
IP transitions:                    18
Polls per day:                  ~1440 (every minute)
Average dwell per IP:           ~9 hours (some <1 h, longest ~24 h)

The two on-call symptoms have the same root cause but different surfaces:

Surface	Mechanism
Grandstream BLF goes amber	The BLF subscription gets a SIP NOTIFY when Asterisk's `DEVSTATE` for the watched extension flips to `UNAVAILABLE`. Grandstream firmware renders that state as amber (or off, depending on model) instead of green.
"Person at extension N is not available"	Same `DEVSTATE=UNAVAILABLE` flag drives the dialplan's three-way branching from PR #297. The "offline" branch falls into `failover` → if no failover is configured, the announcement plays. See User Failover.

So both signals are driven by the same DEVSTATE flag — they're not a coincidence, they're the same underlying state surfacing in two places.

Why phones inside a tunnel still go UNAVAILABLE¶

The customer-tunnel is between the site router (e.g. Grandstream GWN7002 at V7) and the cloud VPS. The phones themselves sit behind that router on the customer LAN, and they REGISTER to cloud Asterisk through the tunnel. When the tunnel disrupts:

Cloud Asterisk's pjsip qualify ping (every 60s, 3s timeout) fails to reach the phone over the broken tunnel route.
Cloud marks the contact Unreachable, then the endpoint Unavailable.
DEVSTATE for that extension flips to UNAVAILABLE. BLF NOTIFY sent to subscribers (the other phones in the office watching this extension). BLF goes amber.
Any inbound call lands on the "unreachable" branch of the dialplan — plays the "not available" announcement when no failover is configured.

Once WireGuard re-handshakes from the new public IP (driven by persistent_keepalive=25 on the customer side, or by the phone's own SIP REGISTER refresh kicking traffic through the tunnel), routing recovers; qualify pings succeed on the next 60s cycle; DEVSTATE flips back to NOT_INUSE; BLF returns to green.

Fixes, ranked by impact¶

#	Fix	Effort	Impact
1	Get a static public IP from the customer's ISP	Operator-led conversation with customer	Eliminates the root cause entirely
2	Configure user failover on the most-called extensions so inbound calls route to a colleague during the unreachable window instead of dead-ending	~5 min per user in the editor	Masks the symptom for callers; doesn't change BLF
3	Lower SIP REGISTER expiry on the phones to 60–90s (Grandstream default is often 3600s). Faster re-REGISTER kicks the tunnel re-handshake sooner after an IP change	~5 min per phone in GS web UI	Cuts the unreachable window roughly in half
4	Lower WireGuard `persistent_keepalive` on the customer router from 25s → 15s	Operator on the customer router	Marginal — re-handshake already triggered by REGISTER traffic

If a customer can't get a static IP, fix #2 is the most operationally important — it stops the caller-facing announcement that operators perceive as a service failure. Fix #1 + #2 together is the ideal posture.

How to detect this pattern on prod¶

ssh root@147.93.168.216 'cd /opt/astrapbx && node -e "
const { Sequelize } = require(\"sequelize\");
require(\"dotenv\").config();
const sq = new Sequelize(process.env.DB_NAME, process.env.DB_USER, process.env.DB_PASSWORD, { host: process.env.DB_HOST, dialect: \"mariadb\", logging: false });
(async()=>{
  const TID = \"<tunnel-id>\";
  const all = await sq.query(\"SELECT endpoint_ip, snapshot_at FROM tunnel_metrics WHERE tunnel_id = :t AND snapshot_at > DATE_SUB(NOW(), INTERVAL 7 DAY) ORDER BY snapshot_at ASC\", { replacements: { t: TID }, type: sq.QueryTypes.SELECT });
  let prev = null, flips = 0;
  for (const r of all) { if (prev !== null && r.endpoint_ip !== prev) flips++; prev = r.endpoint_ip; }
  console.log(\"IP transitions in last 7d:\", flips, \"polls:\", all.length);
  await sq.close();
})();
"'

Replace <tunnel-id> with the customer's customer_tunnels.id. >0 transitions per day strongly indicates this pattern; >5/day means the customer's ISP is reassigning the IP frequently enough that operators will see the symptom multiple times per shift.

Storage and retention (the "Bytes Rx / Tx (3d)" UI column)¶

The Network Tunnels section in the editor shows a Bytes Rx / Tx (3d) column. This is a frequent source of confusion — operators see large numbers (V7 shows ~378 MB / ~390 MB on 2026-05-24) and wonder if we're hoarding gigabytes of metrics on disk.

We're not. The displayed number is network traffic that crossed the tunnel in the last 3 days, NOT disk usage.

What the number is¶

Rx (received) = bytes the cloud server received from the customer LAN through the tunnel (voice RTP from the customer's phones, SIP signaling, qualify pings, etc.)
Tx (sent) = bytes the cloud sent to the customer LAN through the tunnel (caller RTP, BLF NOTIFY, qualify responses, etc.)
(3d) = a rolling delta computed server-side: (latest cumulative counter) - (cumulative counter 3 days ago)

The cumulative counter itself comes from the Linux WireGuard kernel module (wg show wg1), which counts every UDP byte that crosses the interface. We poll it every 60s and snapshot it into tunnel_metrics. The "3d" delta is the difference between the latest snapshot and the snapshot from 3 days ago — same shape as a network usage graph that resets every 3 days. The actual (3d) window cap is enforced at the API layer (MAX_USAGE_WINDOW_DAYS) so callers can never request a window longer than what retention can supply.

What we actually store on disk¶

api/src/services/network/wireguardStatusPoller.js:

Constant	Value	Purpose
`DEFAULT_RETENTION_DAYS`	`7`	Age cutoff — `tunnel_metrics` rows older than 7 days are deleted
`DEFAULT_RETENTION_PRUNE_INTERVAL_MS`	`60 * 60 * 1000` (1 hour)	How often the age + per-org-cap prune runs
per-row size	~150 bytes	snapshot_at, latest_handshake_at, endpoint_ip, endpoint_port, bytes_received, bytes_sent, peer_count_total + timestamps

Per tunnel: 7 days × 1440 polls/day × ~150 bytes ≈ ~1.5 MB on disk. For an instance with 11 tunnels, the total tunnel_metrics table is ~17 MB — negligible.

A second prune (_enforcePerOrgCap()) caps each org's total tunnel_metrics rows to maxBytesPerOrg (default 10 MB / org) and deletes the oldest rows for any org that exceeds it. Belt-and-suspenders behind the age prune in case polling ever runs faster than expected.

How to verify the prune is running¶

ssh root@147.93.168.216 'cd /opt/astrapbx && node -e "
const { Sequelize } = require(\"sequelize\");
require(\"dotenv\").config();
const sq = new Sequelize(process.env.DB_NAME, process.env.DB_USER, process.env.DB_PASSWORD, { host: process.env.DB_HOST, dialect: \"mariadb\", logging: false });
(async()=>{
  const r = await sq.query(\"SELECT COUNT(*) AS n, MIN(snapshot_at) AS oldest, MAX(snapshot_at) AS newest FROM tunnel_metrics\", { type: sq.QueryTypes.SELECT });
  console.log(r[0]);
  await sq.close();
})();
"'

oldest should sit at exactly 7 days ago (with hourly granularity). If it's older than 7 days, the prune isn't running — check pm2 logs for [wg-poller] error lines.

TL;DR¶

Number	What it is	Where it lives
378 MB Rx + 390 MB Tx (3-day delta, in the UI)	Voice / SIP traffic that flowed through the tunnel	Linux WireGuard kernel counter (live, read on demand)
~1.5 MB / tunnel	Snapshots of that counter over 7 days for the delta calculation	`tunnel_metrics` table on prod MariaDB

The big number is bandwidth, not storage. Retention is implemented and verified — tunnel_metrics never grows past ~7 days × poll rate per tunnel.

Operations¶

Daily verification¶

ssh root@147.93.168.216 'wg show wg1'

Look for: each active customer has a handshake within the last few minutes.

Rolling key for a customer (compromise scenario)¶

Editor → V7 → Network Tunnels → Disable existing tunnel
Delete (after operator confirms)
Customer regenerates WG keys on their side (new pubkey)
Operator creates a new tunnel with the new pubkey
New tunnel comes up; subnet is the same (allocator picks the previously-revoked slot if 30d elapsed, else a fresh one)

Customer subnet exhaustion¶

Capacity is 16,384 /30s under 10.20.0.0/16. We're nowhere near this. If reached: expand to 10.21.0.0/16 etc. — generator and allocator support multiple pool CIDRs.

Rollback the feature entirely¶

Pre-deploy DB snapshot (taken before any prod schema change):

Item	Value
Path on prod	`/root/pre-customer-tunnels-2026-05-12T115633Z/pbx_api_db.sql.gz`
Off-box copy	`~/AstradialBackups/pre-customer-tunnels-2026-05-12T115633Z/` (Hari's MacBook)
Size	290 KB gzipped
Tables captured	30 (matches live `information_schema.tables` count)
Verified	gzip integrity OK; `-- Dump completed on 2026-05-12 17:26:33` footer present; 30× `CREATE TABLE`
Flags used	`--single-transaction --quick --skip-lock-tables --routines --triggers --events`
Retain until	2026-08-12 (3 months)

Steps

Revert merge to main on astradial-platform (via GitHub PR)
npx sequelize-cli db:migrate:undo on prod (and staging) — undoes the tunnel_metrics then customer_tunnels migrations
systemctl stop wg-quick@wg1 on both VPSes
Remove /etc/wireguard/wg1.conf, /etc/wireguard/wg1.private, /etc/wireguard/wg1.public, /usr/local/sbin/customer-tunnels-iptables.sh
ufw delete allow 51821/udp
Any active customer tunnels revert to direct-cloud registration (their phones already have this as a fallback)

If migration rollback fails — restore from snapshot:

ssh root@147.93.168.216 'pm2 stop astrapbx workflow-engine pipecat-flow editor'
ssh root@147.93.168.216 'zcat /root/pre-customer-tunnels-2026-05-12T115633Z/pbx_api_db.sql.gz | mariadb -uroot pbx_api_db'
ssh root@147.93.168.216 'pm2 restart astrapbx workflow-engine pipecat-flow editor'

RTO estimate: ~30 seconds for restore on a 290 KB dump + ~10 s for pm2 cycle = under 1 minute downtime.

V7 — Network Architecture & Resilience — the customer scenario that drove this feature
NUC WireGuard (existing wg0) — the pattern this extends
Multi-Tenant Architecture — org isolation model
Network & Security — overall network topology
Org Management — where this fits in the broader admin UX