Confusing CPU Stats

Incubugs · Nov 4, 2021

Just a quick question if i may

, so we use fusionpbx as a hosted platform that we have many tenants running on several nodes, no issues but i was trying to establish the resources its currently using to estimate max tenants / handsets per node.

So, on the dashboard it says its using 0.74 cpu sometimes raises to 1 or 2 % , thats fine plenty resources left however when i look at status / system status it says that freeswitch is using 9.6 % and fail2ban is using 2.5 % , so which is right ? is the system stats basically saying that freeswitch is using 9% of the 0.7 percent in use ? because as we run these machines as vm's the vm host says its using 2% of the vm cpu so im just trying to establish which is correct.

Many Thanks
K

Adrian Fretwell · Nov 4, 2021

FusionPBX works out CPU usage, by adding (sum) up the percentage CPU used for each running process:

Code:

ps -A -o pcpu

It then counts the number of CPU cores:

Code:

grep -P '^processor' /proc/cpuinfo

Then it simply divides the sum of all the CPU usage by the number of CPU cores.

It is a reasonable measure, but it does not show problem areas like where one core is too busy maybe because of a single threaded process.

I use the top command within the VM, to keep an eye on each CPU, top -i is useful because it hides sleeping processes, when top is running, if you hit the "1" key it will show each CPU, screenshot below:

If you add up all the totals it comes to 92.6, this is basically the 93% you see being consumed by freeswitch, but if you then divide this by 12 you get 7.7 which is the average usage across all the CPU cores, pretty much what FusionPBX reports.

What we need to look out for is any single core going in the the late 90s.

I hope that helps.

Incubugs · Nov 4, 2021

Ahh i see, so my cpu infor sitting at like 0.7 would mean i shoudl be able to put alot more traffic through this vm as freeswitch is at 20% nowehere near 90

Incubugs · Nov 4, 2021

In fact you may be able to advise what you would think the max tenant / users are

I run a VM with 16 proc cores (proxmox is the Hypervisor) , its the only vm on the machine and has 60Gb of ram and a 1Gb leased circuit as the connection, i currently have 150 handsets on this node and its sitting as 0.7 to 1 % cpu as reported via the dashboard. I was thinking this node should be able to handle 1000 handsets , each tenant has there own sip trunk registered (not a global trunk) to make billing easier so i understand all those sip trunks registered is going to use more resources than normal but i was hoping for 1000 handsets per node, what's your thinking ?

Adrian Fretwell · Nov 4, 2021

It is a tough question to answer. Many people struggle with this because there is no "rule of thumb", each individuals situation is different. Not all customers are created equal, what I mean is some will just make calls, but others may have lots of BLFs etc. and this makes a big difference on how many handsets can go on a box.

I know there was a demonstration at Cluecon of a machine running 1 sofia profile and handling 1000 cps with 30,000 concurrent calls, but the pure call numbers are not the significant factor.

I'm sure your box will take a lot more traffic. The weakest link as I see it is mod_sofia. This is what handles SIP for FreeSWICTH, simply handling an INVITE for a call is no big deal but if it also has to generate XML and send BLF notifications to a hundred other phones to say the INVITE is happening, then it starts to become a big deal. So unless the handsets are all configure exactly the same, and all the users make exactly the same number of calls at exactly the same time intervals it is impossible to say how many handsets any given system can handle.

But its not all bad news, we just have to monitor the system as we add more handsets. One thing you can do is take a packet capture for 10 to 15 minutes when your system is busy, then in Wireshark look at the SIP statistics look out for retries where a handset has sent a message to your box and then had to send it again because it did not get a response within the message time out period. This can be an indication that you SIP profile is too busy and is having trouble processing all the SIP messages (do do remember that packets can also get lost in the network).

You can add more SIP Profiles to share the load if required. The screenshot with top above is on a box running about 800 endpoints on a single SIP Profile.

My personal preference is to split my customers over a number of boxes, this way a performance issue with one box will not affect all of my customers, also if a box goes down ti doesn't take out all the customers. I currently run about 800 endpoints per VM and the VMs are running no where near their full capacity.

Some of the measures I also look at, apart from statistics in Wireshark are:

cat /proc/net/udp
mpstat -P ALL 1
netstat -lunp
ls /proc/$(pgrep freeswitch)/task/

I also periodically look at netstat -su and check the RcvbufErrors

I wrote up a cheat sheet somewhere, I'm sure I posted it on here but I can't find it right now. I'll add it to this thread if I find it.

EDIT: I think it is expected that you would have a SIP proxy in front of FreSWITCH once you hit a certain volume of traffic. OpenSIPS for example can take some of the registration and BLF load away from mod_sofia.

Incubugs · Nov 4, 2021

so that's what i was thinking, 800 to 1000 handsets should be ok on a node this size. BLF shouldn't really be an issue as each tenant obviously only sees there own BLF, i guess i will just need to monitor the servers and see what happens, if i add extra sip profiles how would that be configured, would each tenant need assigning to a particular sip profile ? These VM's have a public ip not internal so no nat

Adrian Fretwell · Nov 4, 2021

BLF is a real issue, I have one customer with over 200 endpoints all with groups of BLF. The BLF NOTIFY messages account for most of the SIP traffic for that tenant, way more than calls! Also when each phone registers, sofia has to send out loads of subscription notifications, all adding to system load.

Adding SIP profiles is relatively easy, they just need a unique IP/Port configuration and you need to let your firewall and fail2ban know about the IP/Port configuration. I never split a domain across profiles but there is no reason why you shouldn't. If you do, I think you need to enable "Dial String" in the default settings.

bcmike · Nov 4, 2021

Just a tip, a really great tool for monitoring and collecting machine data over time is LiberNMS. It will collect every statistic known to man and graph it for you, it'll also allow you to set up alerts tied to any statistic.

Adrian Fretwell · Nov 4, 2021

@bcmike I had not heard of LiberNMS, I will take a look. Thank you for the link.

Search

Search

Confusing CPU Stats

Incubugs

Member

Adrian Fretwell

Well-Known Member

Incubugs

Member

Incubugs

Member

Adrian Fretwell

Well-Known Member

Incubugs

Member

Adrian Fretwell

Well-Known Member

bcmike

Active Member

Adrian Fretwell

Well-Known Member