Re: Temporary KVM guest hangs connected to KSM and NUMA balancer

From: Friedrich Weber
Date: Wed Jan 17 2024 - 08:09:45 EST


On 16/01/2024 18:20, Sean Christopherson wrote:
>> Does this make sense to you? Happy to double-check or run more tests if
>> anything seems off.
>
> Ha! It too me a few minutes to realize what went sideways with v2. KVM has an
> in-flight change that switches from host virtual addresses (HVA) to guest physical
> frame numbers (GFN) for the retry check, commit 8569992d64b8 ("KVM: Use gfn instead
> of hva for mmu_notifier_retry").
>
> That commit is in the KVM pull request for 6.8, and so v2 is based on top of a
> branch that contains said commit. But for better or worse (probably worse), the
> switch from HVA=GFN didn't change the _names_ of mmu_invalidate_range_{start,end},
> only the type. So v2 applies and compiles cleanly on 6.7, but it's subtly broken
> because checking for a GFN match against an HVA range is all but guaranteed to get
> false negatives.

Oof, that's nifty, good catch! I'll pay more attention to the
base-commit when testing next time. :)

> If you can try v2 on top of `git://git.kernel.org/pub/scm/virt/kvm/kvm.git next`,
> that would be helpful to confirm that I didn't screw up something else.

Pulled that repository and can confirm:

* 1c6d984f ("x86/kvm: Do not try to disable kvmclock if it was not
enabled", current `next`): reproducer hangs
* v2 [1] ("KVM: x86/mmu: Retry fault before acquiring mmu_lock if
mapping is changing") applied on top of 1c6d984f: no hangs anymore

If I understand the discussion on [1] correctly, there might be a v3 --
if so, I'll happily test that too.

> Thanks very much for reporting back! I'm pretty sure we would have missed the
> semantic conflict when backporting the fix to 6.7 and earlier, i.e. you likely
> saved us from another round of bug reports for various stable trees.

Sure! Thanks a lot for taking a look at this!

Best wishes,

Friedrich

[1] https://lore.kernel.org/all/20240110012045.505046-1-seanjc@xxxxxxxxxx/