Re: Temporary KVM guest hangs connected to KSM and NUMA balancer

From: Friedrich Weber
Date: Tue Jan 16 2024 - 10:37:38 EST


Hi Sean,

On 11/01/2024 17:00, Sean Christopherson wrote:
> This is a known issue. It's mostly a KVM bug[...] (fix posted[...]), but I suspect
> that a bug in the dynamic preemption model logic[...] is also contributing to the
> behavior by causing KVM to yield on preempt models where it really shouldn't.

I tried the following variants now, each applied on top of 6.7 (0dd3ee31):

* [1], the initial patch series mentioned in the bugreport ("[PATCH 0/2]
KVM: Pre-check mmu_notifier retry on x86")
* [2], its v2 that you linked above ("[PATCH v2] KVM: x86/mmu: Retry
fault before acquiring mmu_lock if mapping is changing")
* [3], the scheduler patch you linked above ("[PATCH] sched/core: Drop
spinlocks on contention iff kernel is preemptible")
* both [2] & [3]

My kernel is PREEMPT_DYNAMIC and, according to
/sys/kernel/debug/sched/preempt, defaults to preempt=voluntary. For case
[3], I additionally tried manually switching to preempt=full.

Provided I did not mess up, I get the following results for the
reproducer I posted:

* [1] (the initial patch series): no hangs
* [2] (its v2): hangs
* [3] (the scheduler patch) with preempt=voluntary: no hangs
* [3] (the scheduler patch) with preempt=full: hangs
* [2] & [3]: no hangs

So it seems like:

* [1] (the initial patch series) fixes the hangs, which is consistent
with the feedback in the bugreport [4].
* But weirdly, its v2 [2] does not fix the hangs.
* As long as I stay with preempt=voluntary, [3] (the scheduler patch)
alone is already enough to fix the hangs in my case -- this I did not
expect :)

Does this make sense to you? Happy to double-check or run more tests if
anything seems off.

Best wishes,

Friedrich

[1] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@xxxxxxxxxx/
[2] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@xxxxxxxxxx/
[3] https://lore.kernel.org/all/20240110214723.695930-1-seanjc@xxxxxxxxxx/
[4] https://bugzilla.kernel.org/show_bug.cgi?id=218259#c6