Re: Temporary KVM guest hangs connected to KSM and NUMA balancer
From: Friedrich Weber
Date: Tue Jan 16 2024 - 10:37:38 EST
Hi Sean,
On 11/01/2024 17:00, Sean Christopherson wrote:
> This is a known issue. It's mostly a KVM bug[...] (fix posted[...]), but I suspect
> that a bug in the dynamic preemption model logic[...] is also contributing to the
> behavior by causing KVM to yield on preempt models where it really shouldn't.
I tried the following variants now, each applied on top of 6.7 (0dd3ee31):
* [1], the initial patch series mentioned in the bugreport ("[PATCH 0/2]
KVM: Pre-check mmu_notifier retry on x86")
* [2], its v2 that you linked above ("[PATCH v2] KVM: x86/mmu: Retry
fault before acquiring mmu_lock if mapping is changing")
* [3], the scheduler patch you linked above ("[PATCH] sched/core: Drop
spinlocks on contention iff kernel is preemptible")
* both [2] & [3]
My kernel is PREEMPT_DYNAMIC and, according to
/sys/kernel/debug/sched/preempt, defaults to preempt=voluntary. For case
[3], I additionally tried manually switching to preempt=full.
Provided I did not mess up, I get the following results for the
reproducer I posted:
* [1] (the initial patch series): no hangs
* [2] (its v2): hangs
* [3] (the scheduler patch) with preempt=voluntary: no hangs
* [3] (the scheduler patch) with preempt=full: hangs
* [2] & [3]: no hangs
So it seems like:
* [1] (the initial patch series) fixes the hangs, which is consistent
with the feedback in the bugreport [4].
* But weirdly, its v2 [2] does not fix the hangs.
* As long as I stay with preempt=voluntary, [3] (the scheduler patch)
alone is already enough to fix the hangs in my case -- this I did not
expect :)
Does this make sense to you? Happy to double-check or run more tests if
anything seems off.
Best wishes,
Friedrich
[1] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@xxxxxxxxxx/
[2] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@xxxxxxxxxx/
[3] https://lore.kernel.org/all/20240110214723.695930-1-seanjc@xxxxxxxxxx/
[4] https://bugzilla.kernel.org/show_bug.cgi?id=218259#c6