Re: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()

From: Yan Zhao
Date: Tue Feb 20 2024 - 02:25:09 EST


On Mon, Feb 19, 2024 at 11:44:54AM -0800, Sean Christopherson wrote:
> +Jim
>
> On Mon, Feb 19, 2024, Yan Zhao wrote:
> > On Fri, Feb 09, 2024 at 02:28:57PM -0800, Sean Christopherson wrote:
> > > @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > > smp_rmb();
> > >
> > > + if (!slot)
> > > + goto faultin_pfn;
> > > +
> > > + /*
> > > + * Retry the page fault if the gfn hit a memslot that is being deleted
> > > + * or moved. This ensures any existing SPTEs for the old memslot will
> > > + * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> > > + */
> > > + if (slot->flags & KVM_MEMSLOT_INVALID)
> > > + return RET_PF_RETRY;
> > > +
> > > + if (!kvm_is_visible_memslot(slot)) {
> > > + /* Don't expose KVM's internal memslots to L2. */
> > > + if (is_guest_mode(vcpu)) {
> > > + fault->slot = NULL;
> > > + fault->pfn = KVM_PFN_NOSLOT;
> > > + fault->map_writable = false;
> > > + return RET_PF_CONTINUE;
> > Call kvm_handle_noslot_fault() to replace returning RET_PF_CONTINUE?
>
> Oof. Yes. But there is a pre-existing bug here too, though it's very theoretical
> and unlikely to ever cause problems.
>
> If KVM is using TDP, but L1 is using shadow paging for L2, then routing through
> kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an
> MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode
> ensure KVM uses different roots for L1 vs. L2. But mmio_gfn will remain valid,
> and could (quite theoretically) cause KVM to incorrectly treat an L1 access to
> the private TSS or identity mapped page tables as MMIO.
Why would KVM treat L1 access to the private TSS and identity mapped page
tables as MMIO even though mmio_gfn is valid?
It looks that (for Intel platform) EPT for L1 will only install normal SPTEs
(non-MMIO SPTEs) for the two private slots, so there would not have EPT
misconfiguration and would not go to emulation path incorrectly.
Am I missing something?

> Furthermore, this check doesn't actually prevent exposing KVM's internal memslots
> to L2, it simply forces KVM to emulate the access. In most cases, that will trigger
> MMIO, amusingly due to filling arch.mmio_gfn. And vcpu_is_mmio_gpa() always
> treats APIC accesses as MMIO, so those are fine. But the private TSS and identity
> mapped page tables could go either way (MMIO or access the private memslot's backing
> memory).
Yes, this is also my question when reading that part of code.
mmio_gen mismatching may lead to accessing the backing memory directly.

> We could "fix" the issue by forcing MMIO emulation for L2 access to all internal
> memslots, not just to the APIC. But I think that's actually less correct than
> letting L2 access the private TSS and indentity mapped page tables (not to mention
> that I can't imagine anyone cares what KVM does in this case). From L1's perspective,
> there is R/W memory at those memslot, the memory just happens to be initialized
> with non-zero data, and I don't see a good argument for hiding that memory from L2.
> Making the memory disappear is far more magical than the memory existing in the
> first place.
>
> The APIC access page is special because KVM _must_ emulate the access to do the
> right thing. And despite what commit 3a2936dedd20 ("kvm: mmu: Don't expose private
> memslots to L2") said, it's not just when L1 is accelerating L2's virtual APIC,
> it's just as important (likely *more* imporant* for correctness when L1 is passing
> through its own APIC to L2.
>
> Unless I'm missing someting, I think it makes sense to throw in the below before
> moving code around.
>
> Ouch, and looking at this further, patch 1 introduced a bug (technically) by caching
> fault->slot; in this case KVM will unnecessarily check mmu_notifiers. That's
> obviously a very benign bug, as a false positive just means an unnecessary retry,
> but yikes.
>
Patch 3 & 4 removed the bug immediately :)

> --
> Subject: [PATCH] KVM: x86/mmu: Don't force emulation of L2 accesses to
> non-APIC internal slots
>
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
> arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 488f522f09c6..4ce824cec5b9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4341,8 +4341,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> return RET_PF_RETRY;
>
> - if (!kvm_is_visible_memslot(slot)) {
> - /* Don't expose private memslots to L2. */
> + if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
> + /*
> + * Don't map L1's APIC access page into L2, KVM doesn't support
> + * using APICv/AVIC to accelerate L2 accesses to L1's APIC,
> + * i.e. the access needs to be emulated. Emulating access to
> + * L1's APIC is also correct if L1 is accelerating L2's own
> + * virtual APIC, but for some reason L1 also maps _L1's_ APIC
> + * into L2. Note, vcpu_is_mmio_gpa() always treats access to
> + * the APIC as MMIO. Allow an MMIO SPTE to be created, as KVM
> + * uses different roots for L1 vs. L2, i.e. there is no danger
> + * of breaking APICv/AVIC for L1.
> + */
> if (is_guest_mode(vcpu)) {
> fault->slot = NULL;
> fault->pfn = KVM_PFN_NOSLOT;
Checking fault->is_private before calling kvm_handle_noslot_fault()?
And do we need a centralized check of fault->is_private in kvm_mmu_do_page_fault()
before returning RET_PF_EMULATE?

> @@ -4355,8 +4365,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> * MMIO SPTE. That way the cache doesn't need to be purged
> * when the AVIC is re-enabled.
> */
> - if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> - !kvm_apicv_activated(vcpu->kvm))
> + if (!kvm_apicv_activated(vcpu->kvm))
> return RET_PF_EMULATE;
Otherwise, here also needs a checking of fault->is_private?
Maybe also for where RET_PF_EMULATE is returned when page_fault_handle_page_track()
is true (though I know it's always false for TDX).

> }
>
>
> base-commit: ec98c2c1a07fb341ba2230eab9a31065d12d9de6
> --