Re: [PATCH v4 12/12] KVM: x86/mmu: convert kvm_zap_gfn_range() to use shared mmu_lock in TDP MMU

From: Sean Christopherson
Date: Fri Aug 25 2023 - 17:35:44 EST


On Fri, Jul 14, 2023, Yan Zhao wrote:
> Convert kvm_zap_gfn_range() from holding mmu_lock for write to holding for
> read in TDP MMU and allow zapping of non-leaf SPTEs of level <= 1G.
> TLB flushes are executed/requested within tdp_mmu_zap_spte_atomic() guarded
> by RCU lock.
>
> GFN zap can be super slow if mmu_lock is held for write when there are
> contentions. In worst cases, huge cpu cycles are spent on yielding GFN by
> GFN, i.e. the loop of "check and flush tlb -> drop rcu lock ->
> drop mmu_lock -> cpu_relax() -> take mmu_lock -> take rcu lock" are entered
> for every GFN.
> Contentions can either from concurrent zaps holding mmu_lock for write or
> from tdp_mmu_map() holding mmu_lock for read.

The lock contention should go away with a pre-check[*], correct? That's a more
complete solution too, in that it also avoids lock contention for the shadow MMU,
which presumably suffers the same problem (I don't see anything that would prevent
it from yielding).

If we do want to zap with mmu_lock held for read, I think we should convert
kvm_tdp_mmu_zap_leafs() and all its callers to run under read, because unless I'm
missing something, the rules are the same regardless of _why_ KVM is zapping, e.g.
the zap needs to be protected by mmu_invalidate_in_progress, which ensures no other
tasks will race to install SPTEs that are supposed to be zapped.

If you post a version of this patch that converts kvm_tdp_mmu_zap_leafs(), please
post it as a standalone patch. At a glance it doesn't have any dependencies on the
MTRR changes, and I don't want this type of changed buried at the end of a series
that is for a fairly niche setup. This needs a lot of scrutiny to make sure zapping
under read really is safe.

[*] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@xxxxxxxxxx

> After converting to hold mmu_lock for read, there will be less contentions
> detected and retaking mmu_lock for read is also faster. There's no need to
> flush TLB before dropping mmu_lock when there're contentions as SPTEs have
> been zapped atomically and TLBs are flushed/flush requested immediately
> within RCU lock.
> In order to reduce TLB flush count, non-leaf SPTEs not greater than 1G
> level are allowed to be zapped if their ranges are fully covered in the
> gfn zap range.
>
> Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
> ---