Re: [PATCH] KVM: move memslot invalidation later than possible failures

From: Sean Christopherson
Date: Wed Nov 09 2022 - 19:33:37 EST


On Wed, Nov 09, 2022, Yan Zhao wrote:
> On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote:
> > On Tue, Nov 08, 2022, Yan Zhao wrote:
> > > For memslot delete and move, kvm_invalidate_memslot() is required before
> > > the real changes committed.
> > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
> > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
> > > arch x86.
> > > And according to the definition in kvm_page_track_notifier_node, users can
> > > drop write-protection for the pages in the memory slot on receiving
> > > .track_flush_slot.
> >
> > Ugh, that's a terrible API. The entire page track API is a mess, e.g. KVMGT is
> > forced to grab its own references to KVM and also needs to manually acquire/release
> > mmu_lock in some flows but not others.
> >
> > Anyways, this is a flaw in the page track API that should be fixed. Flushing a
> > slot should not be overloaded to imply "this slot is gone", it should be a flush
> > command, no more, no less.
> hmm, but KVM also registers to the page track
> "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;"
> And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow
> page tables) is zapped. And during the zap, unaccount_shadowed() will drop the
> write protection. But KVM is ok because the dropped write-protection can be
> rebuilt during rebuilding the shadow page table.
> So, for .track_flush_slot, it's expected that "users can drop write-protection
> for the pages in the memory slot", right?

No. KVM isn't actually dropping write-protection, because for the internal KVM
case, KVM obliterates all of its page tables.

> > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter
> > of adding another hook that's invoked when the memory region change is committed.
> >
> Do you mean adding a new hook in page track, e.g. .track_slot_change?
> Then right before committing slot changes, call this interface to notify slot
> DELETE/MOVE?
>
> Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to
> support below usecase:
> 1. KVM pre-populated a memslot
> 2. memslot MOVE happened
> 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails).
> So also add a new interface for the MOVE failure?
>
> > That would allow KVMGT to fix another bug. If a memory region is moved and the
> > new region partially overlaps the old region, KVMGT technically probably wants to
> > retain its write-protection scheme. Though that's probably not worth supporting,
> > might be better to say "don't move memory regions if KVMGT is enabled", because
> > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was
> > broken for years and no one noticed).
> >
> So, could we disable MOVE in KVM at all?

Ideally, yes. Practically? Dunno. It's difficult to know whether or not there
are users out there.

> > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the
> > region is moved to, maybe we can get away with rejecting MOVE if an external
> > page tracker cares about the slot in question.
> >
> > > However, if kvm_prepare_memory_region() fails, the later
> > > kvm_activate_memslot() will only swap back the original slot, leaving
> > > previous write protection not recovered.
> > >
> > > This may not be a problem for kvm itself as a page tracker user, but may
> > > cause problem to other page tracker users, e.g. kvmgt, whose
> > > write-protected pages are removed from the write-protected list and not
> > > added back.
> > >
> > > So call kvm_prepare_memory_region first for meta data preparation before
> > > the slot invalidation so as to avoid failure and recovery.
> >
> > IIUC, this is purely a theoretical bug and practically speaking can't be a problem
> > in practice, at least not without completely breaking KVMGT.
> >
> > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor
> > protected VM case is the only scenario where DELETE can fail on any architecture).
> > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for
> > kvm_arch_prepare_memory_region().
> But as long as with current code sequence, we can't relying on that
> kvm_prepare_memory_region() will never fail for DELETE.
> Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future
> possible broken.

Agreed, I just don't want to touch the common memslots code unless it's necessary.

> > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the
> > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but
> > the old memslot was not, and so KVM needs to allocate new metadata due to the new
> > memslot needed larger arrays.
> kvm_prepare_memory_region() can also fail for MOVE during live migration when
> memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap().
> and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if
> kvm_alloc_memslot_metadata() fails due to -ENOMEM.
> BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not,
> and so KVM needs to allocate new metadata due to the new memslot needed
> larger arrays".
> >
> > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM
> > that supports KVMGT and moves relevant memory regions, let alove does something
> > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing.
> >
> > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses
> > the existing dirty bitmap but doesn't shift the dirty bits. This is likely
> Do you mean, for the new slot in MOVE, the new dirty bitmap should be
> cleared? Otherwise, why shift is required, given mem->userspace_addr and npages
> are not changed and dirty_bitmap is indexed using rel_gfn
> (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap
> size to BITS_PER_LONG.

Oh, you're right. I forgot that userspace would also be doing a gfn-relative
calculation in the ridiculously theoretically scenario that a memslot is moved
while it's being dirty-logged.

> > another combination that KVM can simply reject.
> >
> > Assuming this is indeed purely theoretical, that should be called out in the
> > changelog. Or as above, simply disallow MOVE in this case, which probably has
> > my vote.
> >
> Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered
> in upstream QEMU.
>
> <...>
>
> > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally
> > correct. It might be, but there are so many subtleties in this code that I don't
> > want to change this ordering unless absolutely necessary, or at least not without
> > an in-depth analysis and a pile of tests. E.g. it's possible one or more
> > architectures relies on unmapping, flushing, and invalidating the old region
> > prior to preparing the new region, e.g. to reuse arch memslot data.
> yes. what about just moving kvm_arch_flush_shadow_memslot() and
> kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()?

I'm not dead set against tweaking the memslots flow, but I don't want to do so to
fix this extremely theoretical bug. In other words, I want to fix this by giving
KVM-GT a more appropriate hook, not by shuffling common KVM code to fudge around
a poor KVM x86 API.

And if KVM-GT moves away from track_flush_slot(), we can delete the hook entirely
after cleaning up clean up another pile of ugly: KVM always registers a page-track
notifier because it relies on the track_flush_slot() call to invoke
kvm_mmu_invalidate_zap_pages_in_memslot(), even when KVM isn't tracking anything.
I'll send patches for this; if/when both land, track_flush_slot() can be deleted
on top.