Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress

From: Sean Christopherson
Date: Wed Jun 07 2023 - 13:54:04 EST


On Thu, Jun 08, 2023, Yu Zhang wrote:
> Thanks again! One more thing that bothers me when reading the mmu notifier,
> is about the TLB flush request. After the APIC access page is reloaded, the
> TLB will be flushed (a single-context EPT invalidation on not-so-outdated
> CPUs) in vmx_set_apic_access_page_addr(). But the mmu notifier will send the
> KVM_REQ_TLB_FLUSH as well, by kvm_mmu_notifier_invalidate_range_start() ->
> __kvm_handle_hva_range(), therefore causing the vCPU to trigger another TLB
> flush - normally a global EPT invalidation I guess.

Yes.

> But, is this necessary?

Flushing when KVM zaps SPTEs is definitely necessary. But the flush in
vmx_set_apic_access_page_addr() *should* be redundant.

> Could we try to return false in kvm_unmap_gfn_range() to indicate no more
> flush is needed, if the range to be unmapped falls within guest APIC base,
> and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?

No, because vmx_flush_tlb_current(), a.k.a. KVM_REQ_TLB_FLUSH_CURRENT, flushes
only the current root, i.e. on the current EP4TA. kvm_unmap_gfn_range() isn't
tied to a single vCPU and so needs to flush all roots. We could in theory more
precisely track which roots needs to be flushed, but in practice it's highly
unlikely to matter as there is typically only one "main" root when TDP (EPT) is
in use. In other words, KVM could avoid unnecessarily flushing entries for other
roots, but it would incur non-trivial complexity, and the probability of the
precise flushing having a measurable impact on guest performance is quite low, at
least outside of nested scenarios.

But as above, flushing in vmx_set_apic_access_page_addr() shouldn't be necessary.
If there were SPTEs, then KVM would already have zapped and flushed. If there
weren't SPTEs, then it should have been impossible for the guest to have valid
TLB entries. KVM needs to flush when VIRTUALIZE_APIC_ACCESSES is toggled on, as
the CPU could have non-vAPIC TLB entries, but that's handled by vmx_set_virtual_apic_mode().

I'll send a follow-up patch to drop the flush from vmx_set_apic_access_page_addr(),
I don't *think* I'm missing an edge case...