Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use

From: Sean Christopherson
Date: Thu Jul 22 2021 - 15:06:55 EST


+Ben

On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > I am more inclined to fix this by just tracking if we hold the srcu
> > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > and then kvm_request_apicv_update can use this to drop the srcu
> > > lock when needed.
> >
> > The entire approach of dynamically adding/removing the memslot seems doomed to
> > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > and again on re-enable.
> >
> > Rather than pile on more gunk, what about special casing the APIC access page
> > memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when
> > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > fault path skip directly to MMIO emulation without caching the MMIO info. It'd
> > also give us a good excuse to rename try_async_pf() :-)
> >
> > If lack of MMIO caching is a performance problem, an alternative solution would
> > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > clear their cache.
> >
> > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > less awful than the current memslot+SRCU mess.
>
> Hi!
>
> I am testing your approach and it actually works very well! I can't seem to break it.
>
> Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?

Glad you asked, there's one more change needed. kvm_zap_gfn_range() currently
takes mmu_lock for read, but it needs to take mmu_lock for write for this case
(more way below).

The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy. The
whole thing is a grey area because KVM is trying to ensure it honors the guest's
UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU,
i.e. for it to work correctly, the guest has to ensure all running vCPUs do the
same transition. So in practice there's likely no observable bug, but it also
means that taking mmu_lock for read is likely pointless, because for things to
work the guest has to serialize all running vCPUs.

Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()? It would
effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to
operate under the mmu read lock"); see attached patch. And we could even bump
the notifier count in that helper, e.g. on top of the attached:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b607e8763aa2..7174058e982b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)

write_lock(&kvm->mmu_lock);

+ kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
+
if (kvm_memslots_have_rmaps(kvm)) {
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
slots = __kvm_memslots(kvm, i);
@@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
if (flush)
kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);

+ kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
+
write_unlock(&kvm->mmu_lock);
}




Back to Maxim's original question...

Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case
where APICv is being disabled while a different vCPU is concurrently faulting in a
new mapping for the APIC page. E.g. it handles this race:

vCPU0 vCPU1
apic_access_memslot_enabled = true;
#NPF on APIC
apic_access_memslot_enabled==true, proceed with #NPF
apic_access_memslot_enabled = false
kvm_zap_gfn_range(APIC);
__direct_map(APIC)

mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate



The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
to bail and resume the guest without fixing the #NPF. After acquiring mmu_lock,
vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
to be called, or just finised) and/or a modified mmu_notifier_seq (after the
count was decremented).

This is why kvm_zap_gfn_range() needs to take mmu_lock for write. If it's allowed
to run in parallel with the page fault handler, there's no guarantee that the
correct apic_access_memslot_enabled will be observed.

if (is_tdp_mmu_fault)
read_lock(&vcpu->kvm->mmu_lock);
else
write_lock(&vcpu->kvm->mmu_lock);

if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here!
goto out_unlock;

if (is_tdp_mmu_fault)
r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
pfn, prefault);
else
r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
prefault, is_tdp);