SPA504G + pfSense NAT traversal configuration

Environment:

  • Freeswitch + FusionPBX hosted on a VPS with a public IP, so no NAT at server end.
  • 3 local offices each with multiple Cisco SPA504G phones (provisioned from Freeswitch) behind a pfSense router/firewall/VDSL connection.
  • IP v4 only.

The system is quite new, having launched in production in June 2019 with current software/firmware on all components. Initially each phone was configured with SIP TCP port 5060 and 120s expiry. However after about 2 weeks the server experienced critically high CPU and continuous failing registrations; these eventually hammered the CPU so badly that it would drop registrations, lose connection to SIP trunk, poor call quality, dropped calls, the list goes on. I've tried both AWS and Vultr as VPS providers to host the server with the same result. Also tried doubling the processors & memory - same result.

When looking at sngrep ^REG during the CPU spikes, phones were trying to register with a new port every second or two - the sngrep message count would just increment relentlessly. Not on all phones at once, but quite a few (I have about 30 in total).

We reconfigured the phones so every line had its own TCP port (5111, 5112, 5113, etc.). This stabilised the system for 1.5 to 2 weeks then the same problems, CPU spike etc., came back without any obvious trigger.

We reconfigured the phones so that, in addition to keeping the unique ports, they would use SIP UDP with an expiry of 800s. This has mostly stabilised the system but I still see the odd unexpected sngrep message 401 Unauthorized in phone registrations. I would like to further stabilise the system with an improved configuration, if possible.

The following settings are in use:

  • STUN server: not currently used
  • NDLB-broken-auth-hash true
  • NDLB-force-rport true
  • NDLB-received-in-nat-reg-contact true
  • Settings as recommended by DigitalDaz in Cisco Tips and Tricks forum (RTP Packet Size 0.020, Handle VIA rport = Insert VIA rport = Yes)

My question: does anyone out there have a stable configuration they are willing to share for this specific combination of freeswitch + multiple Cisco SPA5xxG phones behind pfsense with a DSL connection? Other suggestions welcome!
 

DigitalDaz

Administrator
Staff member
When it is doing this bad stuff, you could really be doing an sngrep at server level and capturing it to see if there is anything obvious. You shouldn't be having to mess with any individual port allocation on the phones etc.

pfsense is also cool in that you can get packet captures on it in the diagnostics so you could also capture whats going on on the pfsense too so you can see what the exact exchange is between the two.
 
The CPU spikes because of registrations is really weird. I'm currently migrating to Fusion but on our old asterisk systems we have several hundred SPA5XX phones registering at 300s intervals via UDP and we barely move the needle on CPU (relating to registrations). this is all done on bare metal though.

DigitalDaz is correct though all that port manipulation should be largely unnecessary. I've also found that with the Cisco when registering via UDP rport is needed but when registering via tcp it is not (not sure why)

On my dev SPA phones all my SIP NAT parameters are set to NO (nat mapping is enabled per line) and my reg is at 120 ms via TCP. My dev steup is on a VM and is nated both at the server and client side via pfsense.

I'd look at all the usual high CPU troubleshooting techniques to make sure its absolutely related to registrations. I'm also wondering if runing on a shared VPS might be part of the problem. Is there anyway to run on your own metal?

Hope that helps..
 
Hey thanks guys, that's helpful. My problem: software background not networking or voip; I've been learning both over the past few years but sngrep & packet captures remain compelling but elusive. I'd be interested in engaging the right expertise for hourly $$ to help me dive into this.

It was observing sngrep that led me to believe it was registration-related. I grabbed some screenshots at the time - see attached. This was back on TCP before moving to unique ports (all lines were set to TCP 5060 120s). In these screenshots I have focussed on two extensions registered to a single phone, x311 & x312. You will note the number of registration attempts by x311 is huge - 825 vs 18 for x312 - and I could see it increment every 1-2 seconds. Inside the session you can see clear differences although I hesitate to draw conclusions due to my lack of knowledge. Many of my phones did not display these problems, despite being same model, same firmware, same config.

sngrep REG.png sngrep REG x311a.png sngrep REG x311b.png sngrep REG x312.png

In terms of whether VPS is an issue: that's why I moved from AWS to Vultr. Granted they could share the same underlying architecture. The voip consultant I've been using to help me get set up has a larger FusionPBX running on Vultr with no problems. He believes my problems are unique to my combination of phones and router. But I'd love to figure it out. I completely agree that CPU should not be doing this!!!! But both the AWS CPU graph and the 'top' utility showed CPU well over 100% when this issue was at its worst - honestly it was horrific (my business is highly dependent on our phones). Looking at my prod server right now, in the middle of our business day with 3 active calls, CPU is 2.5%. Moving to bare metal is probably not an option in the short term.

Let me tell you my plan to work around this. I think this is related to NAT traversal though it's only a hunch. Therefore if I create a permanent site to site VPN between my offices and my Vultr network and point the phones to the Fusion internal IP, in theory my problems should disappear. I already have site to site OpenVPN between my offices and I've managed to spin up a pfsense server on Vultr so I just need to refresh my noggin and get it all working.

Anyway let me know if any of that sparks anything, or if you're interested in doing some paid interactive diagnostics. I'm in Australia so timezone differences can be inconvenient.
 
It's not really NAT its just what they call the keep-alive settings on that particular phone. I was thinking that if they're not getting a 200k OK back maybe the phone has a bad or closed off NAT state on the pfsense box. The keep alive is really more for UDP than TCP though.