Re: Temporary KVM guest hangs connected to KSM and NUMA balancer

From: Sean Christopherson
Date: Thu Jan 11 2024 - 11:00:19 EST


On Thu, Jan 04, 2024, Friedrich Weber wrote:
> Hi,
>
> some of our (Proxmox VE) users have been reporting [1] that guests
> occasionally become unresponsive with high CPU usage for some time
> (varying between ~1 and more than 60 seconds). After that time, the
> guests come back and continue running fine. Windows guests seem most
> affected (not responding to pings during the hang, RDP sessions time
> out). But we also got reports about Linux guests. This issue was not
> present while we provided (host) kernel 5.15 and was first reported when
> we rolled out a kernel based on 6.2. The reports seem to concern NUMA
> hosts only. Users reported that the issue becomes easier to trigger the
> more memory is assigned to the guests. Setting mitigations=off was
> reported to alleviate (but not eliminate) the issue. The issue seems to
> disappear after disabling KSM.
>
> We can reproduce the issue with a Windows guest on a NUMA host, though
> only occasionally and not very reliably. Using a bpftrace script like
> [7] we found the hangs to correlate with long-running invocations of
> `task_numa_work` (more than 500ms), suggesting a connection to the NUMA
> balancer. Indeed, we can't reproduce the issue after disabling the NUMA
> balancer with `echo 0 > /proc/sys/kernel/numa_balancing` [2] and got a
> user confirming this fixes the issue for them [3].
>
> Since the Windows reproducer is not very stable, we tried to find a
> Linux guest reproducer and have found one (described below [0]) that
> triggers a very similar (hopefully the same) issue. The reproducer
> triggers the hangs also if the host is on current Linux 6.7-rc8
> (610a9b8f). A kernel bisect points to the following as the commit
> introducing the issue:
>
> f47e5bbb ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and
> mmu_notifier unmap")
>
> which is why I cc'ed Sean and Paolo. Because of the possible KSM
> connection I cc'ed Andrew and linux-mm.
>
> Indeed, on f47e5bbb~1 = a80ced6e ("KVM: SVM: fix panic on out-of-bounds
> guest IRQ") the reproducer does not trigger the hang, and on f47e5bbb it
> triggers the hang.
>
> Currently I don't know enough about the KVM/KSM/NUMA balancer code to
> tell how the patch may trigger these issues. Any idea who we could ask
> about this, or how we could further debug this would be greatly appreciated!

This is a known issue. It's mostly a KVM bug[1][2] (fix posted[3]), but I suspect
that a bug in the dynamic preemption model logic[4] is also contributing to the
behavior by causing KVM to yield on preempt models where it really shouldn't.

[1] https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@xxxxxxxxxxxxxxxxxxxxxxxxx
[2] https://lore.kernel.org/all/bug-218259-28872@xxxxxxxxxxxxxxxxxxxxxxxxx%2F
[3] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@xxxxxxxxxx
[4] https://lore.kernel.org/all/20240110214723.695930-1-seanjc@xxxxxxxxxx