Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

From: Gavin Shan
Date: Thu Jun 08 2023 - 19:19:28 EST


Hi Marc,

[Cc Andrea/David/Peter Xu]

On 6/9/23 12:31 AM, Marc Zyngier wrote:
On Thu, 08 Jun 2023 10:03:48 +0100,
Gavin Shan <gshan@xxxxxxxxxx> wrote:

We run into guest hang in edk2 firmware when KSM is kept as running
on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
pflash device (TYPE_PFLASH_CFI01) during the operation for sector
erasing or buffered write. The status is returned by reading the
memory region of the pflash device and the read request should
have been forwarded to QEMU and emulated by it. Unfortunately, the
read request is covered by an illegal stage2 mapping when the guest
hang issue occurs. The read request is completed with QEMU bypassed and
wrong status is fetched.

The illegal stage2 mapping is populated due to same page mering by
KSM at (C) even the associated memory slot has been marked as invalid
at (B).

CPU-A CPU-B
----- -----
ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
kvm_vm_ioctl_set_memory_region
kvm_set_memory_region
__kvm_set_memory_region
kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_copy_memslot
kvm_replace_memslot
kvm_swap_active_memslots (A)
kvm_arch_flush_shadow_memslot (B)
same page merging by KSM
kvm_mmu_notifier_change_pte
kvm_handle_hva_range
__kvm_handle_hva_range (C)

Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping. Without the illegal stage2 mapping, the read
request for the pflash's status is forwarded to QEMU and emulated by
it. The correct pflash's status can be returned from QEMU to break
the infinite wait in edk2 firmware.

Huh, nice one :-(.


Yeah, it's a sneaky one :)


Cc: stable@xxxxxxxxxxxxxxx # v5.13+
Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
Reported-by: Shuai Hu <hshuai@xxxxxxxxxx>
Reported-by: Zhenyu Zhang <zhenyzha@xxxxxxxxxx>
Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
---
virt/kvm/kvm_main.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..7f81a3a209b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
unsigned long hva_start, hva_end;
slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+ if (slot->flags & KVM_MEMSLOT_INVALID)
+ continue;
+
hva_start = max(range->start, slot->userspace_addr);
hva_end = min(range->end, slot->userspace_addr +
(slot->npages << PAGE_SHIFT));

I don't immediately see what makes it safer. If we're not holding one
of slots_{,arch_}lock in the notifier, we can still race against the
update, can't we? I don't think holding the srcu lock helps us here.


The dead-lock will be triggered if we're holding @slots_lock in
__kvm_handle_hva_range() when we have call site like below. There is
similar reason why we can't hold @slots_arch_lock in the function.

CPU-A CPU-B
----- ------
invalidate_range_start()
kvm->mn_active_invalidate_count++
__kvm_handle_hva_range
ioctl(kvm, KVM_SET_USER_MEMORY_REGION) // Delete
kvm_vm_ioctl_set_memory_region
mutex_lock(&kvm->slots_lock);
__kvm_set_memory_region
kvm_set_memslot(..., KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_replace_memslot
kvm_swap_active_memslots // infinite wait until
mutex_unlock(&kvm->slots_lock); // kvm->mn_active_invalidate_count is 0
invalidate_range_end
__kvm_handle_hva_range
mutex_lock(&kvm->slots_lock); // lock taken by CPU-B
mutex_unlock(&kvm->slots_lock);
--kvm->mn_active_invalidate_count

change_pte() is always surrounded by invalidate_range_start and
invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
larger than zero when change_pte() is called. With this condition
(kvm->mn_active_invalidate_count > 0), The swapping between the inactive
and active memory slots by kvm_swap_active_memslots() can't be done.
So there are two cases for one memory slot when change_pte() is called:
(a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
by kvm_invalidate_memslot(), called before invalidate_range_start();
(b) the memory slot has been deleted from the active memory slots. We're
only concerned by (a) when the active memory slots are iterated in
__kvm_handle_hva_range().

static void kvm_mmu_notifier_change_pte(...)
{
:
/*
* .change_pte() must be surrounded by .invalidate_range_{start,end}().
* If mmu_invalidate_in_progress is zero, then no in-progress
* invalidations, including this one, found a relevant memslot at
* start(); rechecking memslots here is unnecessary. Note, a false
* positive (count elevated by a different invalidation) is sub-optimal
* but functionally ok.
*/
WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
return;
:
}


The srcu lock in __kvm_handle_hva_range() prevents the swapping of
the active and inactive memory slots by kvm_swap_active_memslots(). For
this particular case, it's not relevant because the swapping between
the inactive and active memory slots has been done for once, before
invalidate_range_start() is called.

Thanks,
Gavin