Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

From: Gavin Shan
Date: Mon Jun 12 2023 - 21:07:40 EST

Next message: Huang, Ying: "Re: [PATCH] mm: compaction: skip memory hole rapidly when isolating migratable pages"
Previous message: Ming Lei: "Re: [RFC PATCH 0/4] nvme-tcp: fix hung issues for deleting"
In reply to: Sean Christopherson: "Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot"
Next in thread: Sean Christopherson: "Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Sean,

On 6/13/23 12:41 AM, Sean Christopherson wrote:

On Thu, Jun 08, 2023, Gavin Shan wrote:

We run into guest hang in edk2 firmware when KSM is kept as running
on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
pflash device (TYPE_PFLASH_CFI01) during the operation for sector
erasing or buffered write. The status is returned by reading the
memory region of the pflash device and the read request should
have been forwarded to QEMU and emulated by it. Unfortunately, the
read request is covered by an illegal stage2 mapping when the guest
hang issue occurs. The read request is completed with QEMU bypassed and
wrong status is fetched.

The illegal stage2 mapping is populated due to same page mering by
KSM at (C) even the associated memory slot has been marked as invalid
at (B).

CPU-A CPU-B
----- -----
ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
kvm_vm_ioctl_set_memory_region
kvm_set_memory_region
__kvm_set_memory_region
kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_copy_memslot
kvm_replace_memslot
kvm_swap_active_memslots (A)
kvm_arch_flush_shadow_memslot (B)
same page merging by KSM
kvm_mmu_notifier_change_pte
kvm_handle_hva_range
__kvm_handle_hva_range (C)

Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping. Without the illegal stage2 mapping, the read
request for the pflash's status is forwarded to QEMU and emulated by
it. The correct pflash's status can be returned from QEMU to break
the infinite wait in edk2 firmware.

Cc: stable@xxxxxxxxxxxxxxx # v5.13+
Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")

This Fixes isn't correct. That change only affected x86, which doesn't have this
bug. And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU
notifier callbacks"), arm64 did NOT skip invalid slots

slots = kvm_memslots(kvm);

/* we only care about the pages that the guest sees */
kvm_for_each_memslot(memslot, slots) {
unsigned long hva_start, hva_end;
gfn_t gpa;

hva_start = max(start, memslot->userspace_addr);
hva_end = min(end, memslot->userspace_addr +
(memslot->npages << PAGE_SHIFT));
if (hva_start >= hva_end)
continue;

gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
}

#define kvm_for_each_memslot(memslot, slots) \
for (memslot = &slots->memslots[0]; \
memslot < slots->memslots + slots->used_slots; memslot++) \
if (WARN_ON_ONCE(!memslot->npages)) { \
} else

Unless I'm missing something, this goes all the way back to initial arm64 support
added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup").

The fixes tag was sorted out based on 'git-bisect', not static code analysis. I
agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from
the code. From the 'git-bisect', we found the issue disappears when the head is
commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code").
And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based
MMU notifier callbacks").

I'm declined to fix the issue only for ARM64, more details are given below. If we're
going to limit the issue to ARM64 and fix it for ARM64 only, the fixes tag should be
the one as you pointed. Lets correct it in next revision with:

Cc: stable@xxxxxxxxxxxxxxx # v3.9+
Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")

Reported-by: Shuai Hu <hshuai@xxxxxxxxxx>
Reported-by: Zhenyu Zhang <zhenyzha@xxxxxxxxxx>
Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
---
virt/kvm/kvm_main.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..7f81a3a209b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
unsigned long hva_start, hva_end;
slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+ if (slot->flags & KVM_MEMSLOT_INVALID)
+ continue;

Skipping the memslot will lead to use-after-free. E.g. if an invalidate_range_start()
comes along between installing the invalid slot and zapping SPTEs, KVM will
return from kvm_mmu_notifier_invalidate_range_start() without having dropped all
references to the range.

I.e.

kvm_copy_memslot(invalid_slot, old);
invalid_slot->flags |= KVM_MEMSLOT_INVALID;
kvm_replace_memslot(kvm, old, invalid_slot);

/*
* Activate the slot that is now marked INVALID, but don't propagate
* the slot to the now inactive slots. The slot is either going to be
* deleted or recreated as a new slot.
*/
kvm_swap_active_memslots(kvm, old->as_id);

==> invalidate_range_start()

/*
* From this point no new shadow pages pointing to a deleted, or moved,
* memslot will be created. Validation of sp->gfn happens in:
* - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
* - kvm_is_visible_gfn (mmu_check_root)
*/
kvm_arch_flush_shadow_memslot(kvm, old);

And even for change_pte(), skipping the memslot is wrong, as KVM would then fail
unmap the prior SPTE. Of course, that can't happen in the current code base
because change_pte() is wrapped with invalidate_range_{start,end}(), but that's
more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier
count is elevated in .change_pte()" for details). That's also why this doesn't
show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing
SPTE is present, which is never true due to the invalidation.

Right, those architectural dependencies are really something I worried about.
It's safe to skip the invalid memory slots for ARM64, but it may be unsafe to
do so for other architectures. You've listed the potential risks to do so for
x86. It might be risky with PowerPC's reverse mapping stuff either. I didn't
look into the code for the left architectures. In order to escape from the
architectural conflicts, I would move the check and skip the invalid memory
slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(), something like below. In
this way, the guest hang issue in ARM64 guest is fixed. We may have similar
issue on other architectures, but we can figure out the fix when we have to.
Sean, please let me know if you're happy with this?

arch/arm64/kvm/mmu.c::kvm_set_spte_gfn()
----------------------------------------

bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
kvm_pfn_t pfn = pte_pfn(range->pte);

if (!kvm->arch.mmu.pgt)
return false;

/*
* The memory slot can become invalid temporarily or permanently
* when it's being moved or deleted. Avoid the stage2 mapping so
* that all the read and write requests to the region of the memory
* slot can be forwarded to VMM and emulated there.
*/
if (range->slot->flags & KVM_MEMSLOT_INVALID)
return false;

WARN_ON(range->end - range->start != 1);

:
}

I'd honestly love to just delete the change_pte() callback, but my opinion is more
than a bit biased since we don't use KSM. Assuming we keep change_pte(), the best
option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the
memslot, but with a sanity check and comment explaining the dependency on there
being no SPTEs due to the invalidation. E.g.

It's good idea, but I'm afraid other architectures like PowerPC will still be
affected. So I would like to limit this issue to ARM64 and fix it in its
kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM":
could you please share more information about this? I'm blindly guessing you're
saying KSM isn't used in google cloud?

[...]

Thanks,
Gavin

Next message: Huang, Ying: "Re: [PATCH] mm: compaction: skip memory hole rapidly when isolating migratable pages"
Previous message: Ming Lei: "Re: [RFC PATCH 0/4] nvme-tcp: fix hung issues for deleting"
In reply to: Sean Christopherson: "Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot"
Next in thread: Sean Christopherson: "Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]