V7 — Network Architecture & Resilience Plan¶

Companion to v7-setup.md. Captures the real network architecture at V7 (different from the original target), the multi-WAN + CGNAT challenges that came out of debugging, and the resilience plan (phone-side + WireGuard tunnel).

Read this first before doing anything network-related at V7. The original v7-setup.md describes the architectural target; this doc describes the architectural reality.

Status¶

Area	State
Multi-WAN architecture identified	✅ BSNL (port 3) + Rail/RailTel (port 4) via GWN7002
CGNAT confirmed on both ISPs	✅ V7 WAN IPs are private (192.168.x.x); public IPs are carrier-controlled
Phone-side resilience settings	🟡 Applied on ext 108 (192.168.0.69) — pending rollout to other 17
TCP transport attempt	❌ Tried, broke calls, reverted to UDP. Root cause likely endpoint config bound to UDP transport
Server-side AOR tuning	⚠️ Deferred — should go via dialplanGenerator + per-org profile, not per-org config edit
WireGuard tunnel plan	📋 Designed, not yet implemented
GDMS Cloud remote access	✅ `gdms.cloud/gwn` — V7 org accessible remotely
BSNL number strategy	✅ Decided: keep with CFA (`2108065978007#`) forwarding to Tata DID
Public number	✅ `+918065978007` (Tata DID, already routes end-to-end)
Grandstream + BSNL meeting	🚫 Cancel — no longer needed under new architecture

Architectural reality (what's actually deployed)¶

Network topology¶

              ISP-A: BSNL fibre
                 │ CGNAT — public IP managed by carrier (e.g. 103.197.113.158)
                 │ WAN IP (private): 192.168.101.43
                 ▼
              GWN7002 router (port 3)
                                            ┌─→ LAN 192.168.0.0/24
              GWN7002 router (port 4)       │   ├── UCM6301 (.60)
                 ▲                          │   ├── GRP2636 front desk (.76 — ext 09)
                 │ WAN IP (private): 192.168.1.33  ├── GHP621 hotel-room phones (.62–.81)
                 │ CGNAT — public IP (e.g. 59.93.255.93)            │   ├── GRP2602P common-area phones
              ISP-B: RailTel (Indian Railways)  ─┘   ├── 2× GWN7802 switches
                                                    └── 6× WiFi APs (V7 HOTEL + THIRUPATHI BHIMAS SSIDs)

What "IP changing" actually means at V7¶

There is no single dynamic IP rotating. There are two ISPs, each with their own CGNAT public IP, and the GWN7002 fails over between them. When we saw V7 appear at 103.197.113.158 then 59.93.255.93:

103.197.113.158 ≈ Rail CGNAT public IP
59.93.255.93 ≈ BSNL CGNAT public IP

Plus, each carrier may rotate their CGNAT pool IPs independently. So V7's public-facing IP can change for two distinct reasons:

GWN7002 fails over between BSNL and Rail (multi-WAN behavior)
The carrier rotates the CGNAT pool (carrier-side, V7 has zero control)

Both can happen multiple times per day. This is why phone-side keepalive alone is necessary but not sufficient.

SIP/PBX architectural drift from v7-setup.md target¶

The v7-setup.md "Architecture target" section describes the UCM6301 hosting all extensions internally with one SIP trunk to cloud. In reality, almost all hotel phones register directly to cloud, bypassing the UCM as a trunk bridge.

Evidence (live pjsip show endpoints output): - Multiple extensions (04, 09, 102, 103, 105, 106, 108, 109, 110, 111) register individually to devsip.astradial.com:5080 - Each phone arrives from a different NAT'd port (random high port from carrier CGNAT) - The UCM trunk (org_moijhj2l_trunk1777732151626) exists and is registered, but is NOT used as the trunk for these phones - Some extensions had multiple stale contacts from old CGNAT IPs accumulated (zombie contacts)

Result: 18+ phones × WAN problems instead of 1 trunk × WAN problem. Each phone independently fights CGNAT pinholes, multi-WAN failover, and IP rotation.

Phone inventory (from GDMS Zero Config)¶

24 Grandstream devices discovered, 0 currently registered to UCM (all bypass it):

Model	Count	Purpose	Firmware
GHP621	~14	Hotel-room hospitality phones (basic, no network-change detection)	1.0.1.75
GRP2602P	4	Reception / restaurant / kitchen / common-area phones	1.0.5.55, 1.0.7.64
GRP2636	1	Front desk, ext 09 (MAC `000EC4F5E975` at `192.168.0.76`)	1.0.13.31

Per-extension IP mapping (V7-confirmed):

Ext	User	LAN IP
01	Restaurant	(TBD — update needed)
02	Ocloud CAFE	192.168.0.81
03	Kitchen	192.168.0.62
04	Purchase	192.168.0.73
05	Housekeeping	192.168.0.78
07	MD	192.168.0.80
09	Reception	192.168.0.76
101	Rooms	192.168.0.63
102	Rooms	192.168.0.77
103	Rooms	192.168.0.64
104	Rooms	192.168.0.65
105	Rooms	192.168.0.66
106	Rooms	192.168.0.67
107	Rooms	192.168.0.68
108	Rooms	192.168.0.69
109	Rooms	192.168.0.70
110	Rooms	192.168.0.71
111	Rooms	192.168.0.72

All phones use static IPs (no DHCP) on the LAN. Gateway is 192.168.0.60 (UCM) — but phones connect through it as L3 only; UCM does not handle their SIP registration.

Resilience approach — three layers¶

Layer 1: Phone-side keepalive (immediate, no prod changes)¶

Applied on each phone via web admin (or rolled out via UCM Zero Config). The settings target fast recovery from CGNAT rotation / WAN failover by keeping NAT pinholes alive and re-registering quickly.

Account 1 → SIP Settings → General:

Setting	Value	Why
REGISTER Expiration	30	Re-register every 30s for fast recovery on IP change
Re-Register before Expiration	10	Proactive refresh 10s before expiry
Registration Retry Wait Time	5	Retry every 5s on failure (was 20s)
Enable OPTIONS Keep-Alive	✅ Enable	Holds NAT pinhole open + fast failure detection
OPTIONS Keep-Alive Interval	15	Probes faster than most CGNAT timeouts (~30s)
OPTIONS Keep-Alive Max Retries	3	Default fine
SUBSCRIBE for Registration	OFF	Not needed, adds noise without IP-change benefit
SIP Transport	UDP	TCP attempted, calls broke — see below

Phone Settings → Basic Settings:

Setting	Value
Keep-Alive Interval	15
STUN Server	`stun.l.google.com:19302`
Use Random Port	✅

Effect: After any WAN-side IP change, the phone re-registers from the new public IP within ~15–30 seconds. Outbound calls work immediately; inbound calls work after first OPTIONS round-trip completes.

Limitation without server-side complement: stale contacts from prior IP accumulate on cloud (max_contacts > 1 by default). Asterisk may still try the old contact first for ~60s after a flip → intermittent inbound failures during that window.

Layer 2: Server-side AOR tuning (deferred to per-org profile system)¶

The natural complement to phone-side keepalive would be:

; in V7's AOR sections
minimum_expiration = 30
maximum_expiration = 30
qualify_frequency = 30
max_contacts = 1
remove_existing = yes

Decision: Do NOT apply per-org by direct config-file edit. Instead, build a resilience-profile system in dialplanGenerator.js:

Profile types: default / dynamic-ip / multi-wan / mobile-friendly
Per-org assignment stored in DB (or JSON during transition)
Editor UI to flip profile per org
Regeneration triggers reload

This avoids the "can't version-control per-org hand edits" problem and lets battery-sensitive orgs opt out. Not built yet. V7 runs without Layer 2 for now; accepts the ~30–60s zombie-contact window after each WAN flip.

Layer 3: WireGuard tunnel (planned)¶

Eliminates CGNAT entirely from the SIP path by giving cloud Asterisk a stable view of V7 at a fixed tunnel IP.

Architecture¶

GWN7002 (V7) ──WireGuard tunnel (over BSNL WAN)──→ Astradial Cloud
  tunnel IP: 10.30.7.2                              tunnel IP: 10.30.7.1
  subnet:    10.30.7.0/30                           UDP port:  51820

Cloud Asterisk sees V7 SIP traffic arriving from 10.30.7.2 ALWAYS,
regardless of which ISP CGNAT IP V7 is currently behind.

Why this works despite single-WAN binding limitation¶

GDMS WireGuard config only allows binding to ONE WAN (BSNL or Rail — no "Any"/"Auto"). The tunnel must pick one. But WireGuard is identity-based, not IP-based:

Failure scenario	Tunnel behavior
BSNL's CGNAT public IP rotates	✅ Tunnel survives — WG only checks crypto identity
BSNL has packet loss / slowdown	✅ Tunnel survives — WG retries
BSNL goes briefly down (<10s)	⚠️ Tunnel pauses, reconnects
BSNL completely down for minutes	❌ Tunnel dies. Need fallback.

For the BSNL-fully-down case: phones use dual SIP accounts — Account 1 via tunnel, Account 2 direct to cloud over whichever WAN is up. When BSNL dies, Account 2 takes over via Rail.

Tunnel subnet plan¶

Existing WireGuard subnets at Astradial: - NUC tunnel: 10.10.10.0/24 (cloud .1, NUC .2) — already in use

V7 allocation: 10.30.7.0/30 point-to-point (4 IPs, only 2 usable) - Cloud peer: 10.30.7.1 - V7 peer: 10.30.7.2

Pattern for future customer tunnels: 10.30.N.0/30 where N is a per-customer identifier.

Cloud-side setup steps (NOT YET EXECUTED — requires explicit "yes proceed")¶

Step 0 — Read existing config (read-only)

ssh root@147.93.168.216 'cat /etc/wireguard/wg0.conf'
ssh root@147.93.168.216 'wg show'
ssh root@147.93.168.216 'ss -unlp | grep wireguard'

Step 1 — Generate keys for V7 peer

ssh root@147.93.168.216
cd /etc/wireguard
umask 077
wg genkey | tee v7_private.key | wg pubkey > v7_public.key
wg genpsk > v7_preshared.key

Step 2 — Backup + add V7 peer

cp /etc/wireguard/wg0.conf /etc/wireguard/wg0.conf.bak-$(date +%F)-pre-v7

Add to /etc/wireguard/wg0.conf:

# === V7 (VSEVEN HOTELS) peer ===
[Peer]
PublicKey = <V7_public_key_from_GDMS>
PresharedKey = <generated_psk>
AllowedIPs = 10.30.7.2/32
PersistentKeepalive = 25
# No Endpoint — V7 is behind CGNAT and initiates the connection to us

Step 3 — Hot-reload

wg syncconf wg0 <(wg-quick strip wg0)
wg show wg0

Step 4 — Update V7 endpoints in /etc/asterisk/pjsip_vseven_hotels.conf - Backup first - Adjust match lines to include 10.30.7.0/30 - pjsip reload

Rollback (if needed)

cp /etc/wireguard/wg0.conf.bak-<date>-pre-v7 /etc/wireguard/wg0.conf
wg syncconf wg0 <(wg-quick strip wg0)
# Similar restore for pjsip config
asterisk -rx "pjsip reload"

Router-side setup (via GDMS)¶

GDMS → Settings → VPN → WireGuard® → Add (or use Setup Wizard)

Form values: | Field | Value | |---|---| | Name | astradial-cloud | | Status | Enable (after full setup) | | Interface | BSNL (only single-WAN binding available) | | Listening Port | 51820 | | Local IP / Mask | 10.30.7.2 / 30 | | Private Key | (pre-generated by GDMS, leave alone) | | Public Key | Copy this — needed for cloud's peer block | | MTU | 1420 |

Peer block (separate step after Save): | Field | Value | |---|---| | Peer Public Key | (cloud's WireGuard public key, get from wg show wg0) | | Pre-Shared Key | (from cloud's v7_preshared.key) | | Allowed IPs | 10.30.7.1/32, 147.93.168.216/32 | | Endpoint | 147.93.168.216:51820 | | Persistent Keepalive | 25 |

Investigation captured during planning¶

GDMS WireGuard config form was inspected. Key findings: - Keys auto-generated by GDMS (no need to manually generate router-side) - Interface (WAN binding) only offers BSNL or Rail — no "Any" or "Auto" option - Peer/endpoint configuration NOT visible in initial Add form — appears in a separate step or via Setup Wizard - Setup Wizard offers four protocols: OpenVPN, WireGuard, IPSec (Site-to-Site), PPTP - IPSec Site-to-Site explicitly advertises "auto-rebuild on WAN IP change" — worth considering as alternative if WireGuard implementation hits limits, but our cloud already runs WG so default to WG

Layer 4: UCM bridge architecture (deferred, not chosen for this round)¶

The "ideal" architecture from v7-setup.md was UCM as the trunk bridge: all 18 phones register locally to UCM (over LAN — no NAT, no CGNAT issues), UCM has one trunk to cloud. This means 1 trunk × WAN problem instead of 18.

Considered and deferred for this iteration because: - Requires creating extensions on UCM for all 18 phones (currently 0 registered) - Requires Zero Config template rollout - Trunk-to-trunk gating issue (from v7-setup.md) needs Grandstream-installer resolution - Phone-side fix (Layer 1) covers ~80% of the problem at lower effort

Will re-evaluate if WireGuard tunnel + phone-side keepalive isn't enough.

Number strategy (decided)¶

Question	Decision	Rationale
Primary public number	`+918065978007` (Tata DID via Astradial)	Already routes end-to-end via cloud; bypasses all BSNL FXO/SIP-trunk complexity
What to do with `04175295093` (BSNL printed number)	Keep with CFA — `2108065978007#` activated on BSNL line	Preserves printed-number recognition for guests/banners; forwards to working path
Pending decisions	Local Number Portability (port `04175295093` → Tata) — future consideration	Eliminates BSNL dependency over time

What we stopped doing¶

The original v7-setup.md had several pending items that became moot with the new architecture:

❌ Grandstream installer meeting about UCM trunk-to-trunk gating — no longer needed
❌ BSNL engineer meeting about SIP-over-Ethernet upgrade — no longer needed
❌ DBC ONT admin password hunt — no longer needed (ONT not in voice path)
❌ UCM BSNL-AstraDial inbound route debugging — replaced by CFA + Tata DID flow

Cancel/repurpose external meetings.

TCP transport attempt (failed — keep notes for future)¶

Switched ext 108 from UDP to TCP. Registration succeeded but outbound calls failed. After revert to UDP, calls work normally.

Root cause not investigated yet, but most likely: V7's PJSIP endpoint config in /etc/asterisk/pjsip_vseven_hotels.conf has explicit transport = transport-udp. When phone registers over TCP, the contact stores ;transport=TCP, but Asterisk uses the endpoint's bound transport for outbound INVITE → transport/contact mismatch → INVITE never reaches phone.

To investigate later if we want TCP: 1. Read endpoint config in /etc/asterisk/pjsip_vseven_hotels.conf 2. Remove explicit transport binding OR add a separate TCP transport endpoint 3. Test with one phone first

Cloud confirmed to listen on TCP/5080:

asterisk -rx "pjsip show transports"
# Transport: transport-tcp tcp 0.0.0.0:5080  ✅

Remote-access tooling¶

Tool	Purpose	URL / access
GDMS Cloud	Remote GWN7002 / UCM / phone management	`https://www.gdms.cloud/gwn` — V7 organization
GDMS — Devices	View / config router, phones, switches, APs	Same
GDMS — Settings → Internet	WAN priority, failover mode, health graphs	Same
GDMS — Settings → VPN	WireGuard / IPSec / OpenVPN setup	Same
GDMS — Settings → Firewall & Security	SIP ALG (look in Advanced Security Settings)	Same
Astradial Cloud Asterisk	SIP server, CDR, recording	`ssh root@147.93.168.216`

Operational notes¶

Verifying phone registrations¶

ssh root@147.93.168.216 'asterisk -rx "pjsip show contacts" | grep moijhj2l'

Look for: - Each ext should have ONE contact (multiple = zombies from prior WAN flips) - Status should be Avail (not Unavail) - Transport should be UDP (TCP didn't work)

After WireGuard goes live, verifying tunnel¶

ssh root@147.93.168.216 'wg show wg0'
# Should show V7 peer with recent handshake and traffic
ssh root@147.93.168.216 'ping -c 3 10.30.7.2'
# Should succeed once tunnel is up

Watching WAN flips¶

On GDMS → Settings → Internet — the WAN Health graph shows green/red segments over the last 12h. Frequent red segments on either WAN = unstable line, worth investigating from V7 IT side.

Incident: dual-WAN packet loss caused by the GWN7002 itself (not the ISPs)¶

Symptom: choppy/degraded calls; GDMS WAN Health showed both BSNL and Rail degrading and recovering at the same times — "when one is down both are down".

Diagnosis (from the cloud over wg1, after ssh root@147.93.168.216):

# Call-rate ping — 1pps ping LIES; loss only shows at ~50pps
ping -i 0.02 -c 500 -W1 192.168.0.1      # router: 14% loss, RTT spikes to 318ms
ping -i 1 -c 20 192.168.0.1              # same moment, 1pps: 0% loss
wg show wg1 endpoints                     # which ISP carries the tunnel (59.93.x=BSNL, 103.197.x=Rail)

Key discriminators that ruled the ISPs out:

Loss occurred while WAN throughput was only ~1–2 Mbps (Device → Usage) — not bandwidth saturation.
Both WANs degraded simultaneously — two independent carriers don't fail in lockstep.
GDMS Device → Info showed load average ~3.5 sustained (dual-core box = ~175%) and CPU 93°C.

Root cause: the GWN7002 in a degraded state — CPU pinned at ~93°C for 18+ h (thermal throttling / accumulated flow-table state) dropping packets on both WANs in software.

Fix: reboot the router from GDMS (~2 min outage; all phones re-register on their own, verify with pjsip show contacts | grep moijhj2l). Loss went 14% → 0% at call rate.

Prevention / follow-ups:

Rack cooling — ~94°C even at 6 min uptime is environmental; the degraded state will recur until fixed.
Load average on the GWN7002 is a red herring. Post-reboot the box showed load ~3.4 with only 28% actual CPU (System Info via SSH CLI) — the load is D-state waiters from the flow engine, not CPU contention. Watch CPU % and temperature, not load average.
SIP ALG was already off; Hardware Acceleration already on (Configuration tab).
GDMS cloud QoS does not support the GWN7002 (GWN7062E/ET only) — QoS lives in the local UI.
Firmware: "New Firmware —" in Organization → Upgrade means already on latest; don't chase upgrades.

Router SSH/CLI access (learned during this incident):

GDMS → Devices → Router → Configuration tab → "Device Password" pushes the local admin password (no lockout risk; the web UI allows only a few attempts before a 15-min lock).
GDMS → Devices → Router → Debug tab → "SSH Remote Access" enables SSH; it auto-disables after 48 h.
ssh admin@192.168.0.1 (reachable from the cloud over wg1) lands in a menu-driven CLI — no busybox shell, no top. Real CPU%/memory/temp: Overview → System Info. Per-WAN ping/traceroute, ARP cache, link-trace and PoE diagnostics: Maintenance → System Diagnostics.

Open questions / pending items¶

Confirm whether GDMS WireGuard form has a separate Peer-configuration step (probably after Save, or via Setup Wizard) — needed to complete the tunnel
Identify the more stable WAN between BSNL and Rail from GDMS Internet → WAN Health history (12h+ data)
Decide whether to upgrade to dual-tunnel (Layer 3 Option B) if single-tunnel BSNL outages become painful
Build the per-org resilience profile system in dialplanGenerator.js (Layer 2)
Confirm SIP ALG state on GWN7002 — Firewall → Advanced Security Settings (suspected enabled, must disable)
Roll Layer 1 phone-side settings from ext 108 to other 17 phones (manual or Zero Config template)
Update v7-setup.md status board to reflect the deprecated BSNL-FXO path and new Tata-DID-primary path

Decisions log¶

Decision	Outcome	Date
Drop BSNL FXO integration; use Tata DID `+918065978007` as primary number	Done conceptually; CFA activation pending V7 IT	This iteration
Keep BSNL line active with CFA forwarding to Tata DID	Preserves printed number with zero work	This iteration
Cancel Grandstream + BSNL meeting	No longer needed under new architecture	This iteration
Defer UCM bridge architecture; use direct cloud registration	Lower effort; phone-side keepalive + WireGuard cover most issues	This iteration
Apply phone-side keepalive resilience settings, starting with ext 108	In progress	This iteration
Don't apply per-org server-side AOR edits; build profile system in generator instead	Avoids "can't version-control per-org hand edits"	This iteration
Use WireGuard for V7 tunnel (matches existing infra) over IPSec	WG keys are easier to manage; cloud already has WG service	This iteration
Single-tunnel on BSNL with dual-account fallback (not dual-tunnel)	Single tunnel covers ~99% of cases at lower complexity	This iteration
TCP transport attempt → revert to UDP	Endpoint config likely UDP-bound; investigation deferred	This iteration

V7 — Master setup info — Status board, credentials, equipment, contacts (the original master doc; treats V7 as single-WAN)
V7 — Meeting Prep Messages — Archived; the Grandstream/BSNL meeting is cancelled
V7 — Meeting Brief (Grandstream + BSNL) — Archived; superseded by Tata-DID-primary architecture
Fail2Ban Runbook — V7's CIDRs are whitelisted there
Troubleshooting — Error 55 (customer IP change → fail2ban storm) covers a closely related issue