URGENT 100% CPU

Status
Not open for further replies.

bcmike

Active Member
Jun 7, 2018
326
54
28
53
Hi,

I have two fusion machines that are pegged at 100% cpu all services are down. Where to start?
 

hfoster

Active Member
Jan 28, 2019
677
80
28
34
A quick tail of the /var/log/freeswitch.log usually helps too. I've seen some incredible loops I didn't think were possible from the more inexperienced people.
 

bcmike

Active Member
Jun 7, 2018
326
54
28
53
Hi everyone,

We did a lot of troubleshooting steps (logs, network, io latency, etc) and decided to migrate the machines to a different server within the Proxmox cluster. This seems to have calmed down the CPU issue, but it's still rather mysterious why everything went high cpu. Most of our customers had degraded service yesterday as we only migrated after hours. Of course we have to wait and see what happens under load today.

The whole server was running high CPU and even after we migrated all the vms off it was still running pegged at 25% bursting to 95% . However this morning the original server is now running at 2% cpu. We were leaning towards it being a hardware fault, but I guess the proof will be today under load.
 

bcmike

Active Member
Jun 7, 2018
326
54
28
53
For anyone following this thread running on Proxmox our leading theory is now that Disk IO starved the VM's of CPU. We're not sure yet but it's explained in this thread from the Proxmox forums and It looks like it can be mitigated but not fixed. We defiantly observed this when restoring backups to the new host.

Thread: https://forum.proxmox.com/threads/kvm-guests-freeze-hung-tasks-during-backup-restore-migrate.34362/

We still don't understand what Disk I/O event occurred but I'm guessing stuck replication or something like that as a reboot of the stack did not help.
 

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,044
565
113
If it is the same as hit me you are on a relatively recent version of Proxmox. You need to upgrade as it will hit you again. Its the version of qemu apparently but has been fixed. After I upgraded it had no further problems. The problem started for me during a backup and killed the whole server.
 

bcmike

Active Member
Jun 7, 2018
326
54
28
53
We moved the VM's to different servers and they stabilized. Funny thing is even after we migrated all the VMs the host was still stuck at 40% cpu but eventually after about four hours it to stabilized and is now at 2% utilization.
If it is the same as hit me you are on a relatively recent version of Proxmox. You need to upgrade as it will hit you again. Its the version of qemu apparently but has been fixed. After I upgraded it had no further problems. The problem started for me during a backup and killed the whole server.
Thanks for the insight! When you say backup to you mean a FusionPbx backup? Or do you mean a Proxmox vm backup? If it's a Fusion backup that would track with what we saw, however our version of Proxmox was a little older; 5.4-3. I moved most of the machines to 6.4-6 however one remains on another machine that's still the old version. Do you happen to remember what your versions were?
 

bcmike

Active Member
Jun 7, 2018
326
54
28
53
Sorry for all the questions it seems to track with what happened to us. The usual backup was scheduled in cron daily for 6:45 am and everything went sideways around 7:30 AM. I've stopped all backups, replication, etc. until I know for sure what happened. When did it happen to you?

I'm hoping this qemu bug doesn't show up again because it could be a deal breaker.
 
Status
Not open for further replies.