CPU utilization with kvm / vhost, differences 3.14 / 4.4 / 4.6

From: Patrick Schaaf
Date: Wed Jul 27 2016 - 12:13:44 EST


Hi,

I'm stumped by a weird development in measured CPU utilization when testing an
upgrade path from 3.14.70 to 4.4.14.

I'm running, on identical hardware (2 4-core Xeon E5420), a HA
(active/standby) pair of firewall/loadbalancer VMs. The OS on the host and the
VM is identical - openSUSE 13.1 userlevel, qemu 1.6.2 KVM, and kernels self-
built from vanilla sources. Inside the VM I make pretty heavy use of ipset,
iptables, and ipvs. Traffic level is around 100 mbit/s, mostly ordinary web
traffic, translating to around 10 kpps.

For the last X months I have been running this on 3.14.x kernels, currently
3.14.70. As that's nearing its end of support, I aim for an upgrade to 4.4.x,
testing with 4.4.14.

For testing, I keep the kernel _within_ the VM stable - i.e. 3.14.70, and
upgrade only the host kernel of one of the two machines, to 4.4.14, and due to
the weirdness I'll describe next, to 4.6.4.

What I see, and what is totally unexpected, is a severe variation in the
system and irq time measured on the host system, and less so inside the VM.

The 3.14.70 running host shows 0.6 cores system and 0.4 cores IRQ time.

The 4.4.14 running host shows 2.3 cores system and 0.4 cores IRQ time.

The same host on 4.6.4, is again back at 0.6 cores system and 0.4 cores IRQ,
while the guest (showing as user outside) is down from the 1 core on the
previous to kernels, to about 0.6 cores (which I wouldn't complain about)

But my desired target kernel, 4.4.14, clearly uses about 1 1/2 cores more on
the same load... (all other indicators and measurements I have show that the
load served is pretty much stable over the situations I tested).

Some details on the networking setup (invariant over the tested kernels):
* host bonds 4 NICs, half on on-board BNX2 BCM5708, other half on PCIe card
intel 82571EB hardware. The bond mode is LACP.
* host lacp bond is then member of an ordinary software bridge interface,
which then also has the tap interface to the VM added. There is vlan
filtering active on the bridge.
* two bridge vlans are separately broken out and member of a second layer
bridge with an extra tap interface to my VM. Don't ask why :) but one of these
carries about half of the traffic
* within the VM, I have another bridge with the VLANs on top and macvlan
sprinkled in (keepalived VRRP setup on several legs)
* host/vm network is virtio, of course
* I had to disable (already some time ago, identical in all tests described
here) TSO / GSO / UFO on the tap interfaces to my VM, to alleviate severe
performance regressions. Different story, mentioning it just for completeness.

Regarding the host hardware, I actually have a third system, software
identical, but with some more cores and purely on BNX2 BCM5719. The 4.4.14-
needs-lots-more-systemtime symptoms were practically the same there.

To end this tale, let me note that I have NO operational problems with the
test using the 4.4.14 kernel, as far as one can know that within some hours of
testing. All production metrics (and I have lots of them) are fine - except
for that system time usage on the host system...

Anybody got a clue what may be happening?

I'm a bit reluctant to jump to 4.6.x or newer kernels, as I like the concept
of long term stable kernels somehow... :)


best regards
Patrick