Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

From: Sean Christopherson
Date: Mon Jun 12 2023 - 10:41:26 EST


On Thu, Jun 08, 2023, Gavin Shan wrote:
> We run into guest hang in edk2 firmware when KSM is kept as running
> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
> erasing or buffered write. The status is returned by reading the
> memory region of the pflash device and the read request should
> have been forwarded to QEMU and emulated by it. Unfortunately, the
> read request is covered by an illegal stage2 mapping when the guest
> hang issue occurs. The read request is completed with QEMU bypassed and
> wrong status is fetched.
>
> The illegal stage2 mapping is populated due to same page mering by
> KSM at (C) even the associated memory slot has been marked as invalid
> at (B).
>
> CPU-A CPU-B
> ----- -----
> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
> kvm_vm_ioctl_set_memory_region
> kvm_set_memory_region
> __kvm_set_memory_region
> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
> kvm_invalidate_memslot
> kvm_copy_memslot
> kvm_replace_memslot
> kvm_swap_active_memslots (A)
> kvm_arch_flush_shadow_memslot (B)
> same page merging by KSM
> kvm_mmu_notifier_change_pte
> kvm_handle_hva_range
> __kvm_handle_hva_range (C)
>
> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> illegal stage2 mapping. Without the illegal stage2 mapping, the read
> request for the pflash's status is forwarded to QEMU and emulated by
> it. The correct pflash's status can be returned from QEMU to break
> the infinite wait in edk2 firmware.
>
> Cc: stable@xxxxxxxxxxxxxxx # v5.13+
> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")

This Fixes isn't correct. That change only affected x86, which doesn't have this
bug. And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU
notifier callbacks"), arm64 did NOT skip invalid slots

slots = kvm_memslots(kvm);

/* we only care about the pages that the guest sees */
kvm_for_each_memslot(memslot, slots) {
unsigned long hva_start, hva_end;
gfn_t gpa;

hva_start = max(start, memslot->userspace_addr);
hva_end = min(end, memslot->userspace_addr +
(memslot->npages << PAGE_SHIFT));
if (hva_start >= hva_end)
continue;

gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
}

#define kvm_for_each_memslot(memslot, slots) \
for (memslot = &slots->memslots[0]; \
memslot < slots->memslots + slots->used_slots; memslot++) \
if (WARN_ON_ONCE(!memslot->npages)) { \
} else

Unless I'm missing something, this goes all the way back to initial arm64 support
added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup").

> Reported-by: Shuai Hu <hshuai@xxxxxxxxxx>
> Reported-by: Zhenyu Zhang <zhenyzha@xxxxxxxxxx>
> Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
> ---
> virt/kvm/kvm_main.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 479802a892d4..7f81a3a209b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> unsigned long hva_start, hva_end;
>
> slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> + if (slot->flags & KVM_MEMSLOT_INVALID)
> + continue;

Skipping the memslot will lead to use-after-free. E.g. if an invalidate_range_start()
comes along between installing the invalid slot and zapping SPTEs, KVM will
return from kvm_mmu_notifier_invalidate_range_start() without having dropped all
references to the range.

I.e.

kvm_copy_memslot(invalid_slot, old);
invalid_slot->flags |= KVM_MEMSLOT_INVALID;
kvm_replace_memslot(kvm, old, invalid_slot);

/*
* Activate the slot that is now marked INVALID, but don't propagate
* the slot to the now inactive slots. The slot is either going to be
* deleted or recreated as a new slot.
*/
kvm_swap_active_memslots(kvm, old->as_id);


==> invalidate_range_start()

/*
* From this point no new shadow pages pointing to a deleted, or moved,
* memslot will be created. Validation of sp->gfn happens in:
* - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
* - kvm_is_visible_gfn (mmu_check_root)
*/
kvm_arch_flush_shadow_memslot(kvm, old);

And even for change_pte(), skipping the memslot is wrong, as KVM would then fail
unmap the prior SPTE. Of course, that can't happen in the current code base
because change_pte() is wrapped with invalidate_range_{start,end}(), but that's
more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier
count is elevated in .change_pte()" for details). That's also why this doesn't
show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing
SPTE is present, which is never true due to the invalidation.

I'd honestly love to just delete the change_pte() callback, but my opinion is more
than a bit biased since we don't use KSM. Assuming we keep change_pte(), the best
option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the
memslot, but with a sanity check and comment explaining the dependency on there
being no SPTEs due to the invalidation. E.g.

---
virt/kvm/kvm_main.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..b4987b49fac3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -686,6 +686,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn

return __kvm_handle_hva_range(kvm, &range);
}
+
+static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ /*
+ * Skipping invalid memslots is correct if and only change_pte() is
+ * surrounded by invalidate_range_{start,end}(), which is currently
+ * guaranteed by the primary MMU. If that ever changes, KVM needs to
+ * unmap the memslot instead of skipping the memslot to ensure that KVM
+ * doesn't hold references to the old PFN.
+ */
+ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
+
+ if (range->slot->flags & KVM_MEMSLOT_INVALID)
+ return false;
+
+ return kvm_set_spte_gfn(kvm, range);
+}
+
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -707,7 +725,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
return;

- kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
+ kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
}

void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,

base-commit: 94cdeebd82111d7b7da5bd4da053eed9e0f65d72
--