Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

From: Yan Zhao
Date: Mon Aug 14 2023 - 22:22:47 EST


On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> > > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA. It's too much of a one-off,
> > > and losing the batching of invalidations makes me nervous. As Bibo points out,
> > > the behavior will vary based on the workload, VM configuration, etc.
> > >
> > > There's also a *very* subtle change, in that the notification will be sent while
> > > holding the PMD/PTE lock. Taking KVM's mmu_lock under that is *probably* ok, but
> > > I'm not exactly 100% confident on that. And the only reason there isn't a more
> > MMU lock is a rwlock, which is a variant of spinlock.
> > So, I think taking it within PMD/PTE lock is ok.
> > Actually we have already done that during the .change_pte() notification, where
> >
> > kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
> > while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers
>
> .change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed
> to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM
> doesn't use a sleepable lock.
.numa_protect() in patch 4 is also sent when it's not allowed to
fail and there's no sleepable lock in KVM's handler :)


> As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail
> because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier
> events. For NUMA balancing, canceling the protection change might be ok, but for
> everything else, failure is not an option. So unfortunately, my idea won't work
> as-is.
>
> We might get away with deferring just the change_prot_numa() case, but I don't
> think that's worth the mess/complexity.
Yes, I also think deferring mmu_notifier_invalidate_range_start() is not working.
One possible approach is to send successful .numa_protect() in batch.
But I admit it will introduce complexity.

>
> I would much rather tell userspace to disable migrate-on-fault for KVM guest
> mappings (mbind() should work?) for these types of setups, or to disable NUMA
> balancing entirely. If the user really cares about performance of their VM(s),
> vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if
> the VM spans multiple nodes, a virtual NUMA topology should be given to the guest.
> At that point, NUMA balancing is likely to do more harm than good.
>
> > > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> > > won't yield if there's mmu_lock contention.
> > Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.
> >
> > > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> > > I think we can achieve what you want without losing batching, and without changing
> > > secondary MMUs.
> > >
> > > Rather than muck with the notification types and add a one-off for NUMA, just
> > > defer the notification until a present PMD/PTE is actually going to be modified.
> > > It's not the prettiest, but other than the locking, I don't think has any chance
> > > of regressing other workloads/configurations.
> > >
> > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> > >
> > > Compile tested only.
> >
> > I don't find a matching end to each
> > mmu_notifier_invalidate_range_start_nonblock().
>
> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>
> if (range.start)
> mmu_notifier_invalidate_range_end(&range);
No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
if we only want the range to include pages successfully set to PROT_NONE.

> --
> From: Sean Christopherson <seanjc@xxxxxxxxxx>
> Date: Mon, 14 Aug 2023 08:59:12 -0700
> Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if
> mapping is changing
>
> Retry page faults without acquiring mmu_lock if the resolve hva is covered
> by an active invalidation. Contending for mmu_lock is especially
> problematic on preemptible kernels as the invalidation task will yield the
> lock (see rwlock_needbreak()), delay the in-progress invalidation, and
> ultimately increase the latency of resolving the page fault. And in the
> worst case scenario, yielding will be accompanied by a remote TLB flush,
> e.g. if the invalidation covers a large range of memory and vCPUs are
> accessing addresses that were already zapped.
>
> Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
> iterators to perform more work before yielding, but that wouldn't solve
> the lock contention and would negatively affect scenarios where a vCPU is
> trying to fault in an address that is NOT covered by the in-progress
> invalidation.
>
> Reported-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@xxxxxxxxxxxxxxxxxxxxxxxxx
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
> arch/x86/kvm/mmu/mmu.c | 3 +++
> include/linux/kvm_host.h | 8 +++++++-
> 2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9e4cd8b4a202..f29718a16211 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> if (unlikely(!fault->slot))
> return kvm_handle_noslot_fault(vcpu, fault, access);
>
> + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> + return RET_PF_RETRY;
> +
This can effectively reduce the remote flush IPIs a lot!
One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
and kvm->mmu_invalidate_range_end.
Otherwise, I'm somewhat worried about constant false positive and retry.


> return RET_PF_CONTINUE;
> }
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cb86108c624d..f41d4fe61a06 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> unsigned long mmu_seq,
> unsigned long hva)
> {
> - lockdep_assert_held(&kvm->mmu_lock);
> /*
> * If mmu_invalidate_in_progress is non-zero, then the range maintained
> * by kvm_mmu_notifier_invalidate_range_start contains all addresses
> @@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> hva >= kvm->mmu_invalidate_range_start &&
> hva < kvm->mmu_invalidate_range_end)
> return 1;
> +
> + /*
> + * Note the lack of a memory barrier! The caller *must* hold mmu_lock
> + * to avoid false negatives and/or false positives (less concerning).
> + * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking
> + * for an in-progress invalidation to avoiding contending mmu_lock.
> + */
> if (kvm->mmu_invalidate_seq != mmu_seq)
> return 1;
> return 0;
>
> base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc
> --
>