Re: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

From: Sean Christopherson
Date: Thu Jan 18 2024 - 12:22:57 EST


On Fri, Jan 19, 2024, Xu Yilun wrote:
> On Tue, Jan 09, 2024 at 05:20:45PM -0800, Sean Christopherson wrote:
> > Retry page faults without acquiring mmu_lock if the resolved gfn is covered
> > by an active invalidation. Contending for mmu_lock is especially
> > problematic on preemptible kernels as the mmu_notifier invalidation task
> > will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
>
> Is it possible fault-in task avoids contending mmu_lock by using _trylock()?
> Like:
>
> while (!read_trylock(&vcpu->kvm->mmu_lock))
> cpu_relax();
>
> if (is_page_fault_stale(vcpu, fault))
> goto out_unlock;
>
> r = kvm_tdp_mmu_map(vcpu, fault);
>
> out_unlock:
> read_unlock(&vcpu->kvm->mmu_lock)

It's definitely possible, but the downsides far outweigh any potential benefits.

Doing trylock means the CPU is NOT put into the queue for acquiring the lock,
which means that forward progress isn't guaranteed. E.g. in a pathological
scenario (and by "pathological", I mean if NUMA balancing or KSM is active ;-)),
it's entirely possible for a near-endless stream of mmu_lock writers to be in
the queue, thus preventing the vCPU from acquiring mmu_lock in a timely manner.

And hacking the page fault path to bypass KVM's lock contention detection would
be a very willful, deliberate violation of the intent of the MMU's yielding logic
for preemptible kernels.

That said, I would love to revisit KVM's use of rwlock_needbreak(), at least in
the TDP MMU. As evidenced by this rash of issues, it's not at all obvious that
yielding on mmu_lock contention is *ever* a net positive for KVM, or even for the
kernel. The shadow MMU is probably a different story since it can't parallelize
page faults with other operations, e.g. yielding in kvm_zap_obsolete_pages() to
allow vCPUs to make forward progress is probably a net positive.

But AFAIK, no one has done any testing to prove that yielding on contention in
the TDP MMU is actually a good thing. I'm 99% certain the only reason the TDP
MMU yields on contention is because the shadow MMU yields on contention, i.e.
I'm confident that no one ever did performance testing to shadow that there is
any benefit whatsoever to yielding mmu_lock in the TDP MMU.

> > invalidation, and ultimately increase the latency of resolving the page
> > fault. And in the worst case scenario, yielding will be accompanied by a
> > remote TLB flush, e.g. if the invalidation covers a large range of memory
> > and vCPUs are accessing addresses that were already zapped.
>
> This case covers all usage of mmu_invalidate_retry_gfn(), is it? Should
> we also consider vmx_set_apic_access_page_addr()?

Nah, reloading the APIC access page is an very, very infrequent operation. And
post-boot (of the VM), the page will only be reloaded if it's migrated by the
primary MMU, i.e. after a relevant mmu_notifier operation. And the APIC access
page is a one-off allocation, i.e. it has its own VMA of PAGE_SIZE, so even if
vmx_set_apic_access_page_addr() does manage to trigger contention with a relevant
mmu_notifier event, e.g. races ahead of page migration completion, the odds of it
forcing KVM's mmu_notifier handler to yield mmu_lock are low (maybe impossible?)
because the invaldation event shouldn't be spanning multipl pages.