Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

From: Sean Christopherson
Date: Thu Aug 17 2023 - 13:54:31 EST


On Thu, Aug 17, 2023, Yan Zhao wrote:
> On Wed, Aug 16, 2023 at 11:18:03AM -0700, Sean Christopherson wrote:
> > On Tue, Aug 08, 2023, Yan Zhao wrote:
> > > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU
> > > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching
> > > notified new PFN into SPTEs directly in the handler.
> > >
> > > As in [1], .change_pte() has been dead code on x86 for 10+ years.
> > > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare
> > > for optimization in TDP MMU in patch 2.
> >
> > If we're going to officially kill the long-dead attempt at optimizing KSM, I'd
> > strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all
> > architectures and remove it from mmu_notifiers. The only reason I haven't proposed
> > such patches is because I didn't want to it to backfire and lead to someone trying
> > to resurrect the optimizations for KSM.
> >
> > > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the
> > > handler directly with PFN info contained in .change_pte() to avoid that
> > > each vCPU write that triggers .change_pte() must undergo twice VMExits and
> > > TDP page faults.
> >
> > IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by
> > using QEMU's prealloc option. It's more coarse grained, but at a minimum it's
> > sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB.
> >
> > And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to
> > prefault memory not just into the primary MMU, but also into KVM's MMU. Such an
> > ioctl() is basically manadatory for TDX, we just need to morph the support being
> > added by TDX into a generic ioctl()[*]
> >
> > Prefaulting guest memory as writable into the primary MMU should be able to achieve
> > far better performance than hooking .change_pte(), as it will avoid the mmu_notifier
> > invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote
> > TLB flush(es). And a KVM ioctl() to prefault into KVM's MMU should eliminate page
> > fault VM-Exits entirely.
> >
> > Explicit prefaulting isn't perfect, but IMO the value added by prefetching in
> > .change_pte() isn't enough to justify carrying the hook and the code in KVM.
> >
> > [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@xxxxxxxxxx
> Hi Sean,
> As I didn't write the full picture of patch 2 in the cover letter well,
> may I request you to take a look of patch 2 to see if you like it? (in
> case if you just read the cover letter).

I read patch two, I replied to the cover letter as I wanted to discuss the two
patches together since implementing the CoW optimization effectively means
dropping the long-dead KSM optimization.

> What I observed is that each vCPU write to a COW page in primary MMU
> will lead to twice TDP page faults.
> Then, I just update the secondary MMU during the first TDP page fault
> to avoid the second one.
> It's not a blind prefetch (I checked the vCPU to ensure it's triggered
> by a vCPU operation as much as possible)

Yes, that's part of the complexity I don't like.

> and it can benefit guests who doesn't explicitly request a prefault memory as
> write.

Yes, I'm arguing that the benefit isn't significant, and that the use cases it
might benefit aren't things people care about optimizing.

I'm very skeptical that shaving those 8000 VM-Exits will translate to a meaningful
reduction in guest boot time, let alone scale beyond very specific scenarios and
configurations, which again, are likely suboptimal in nature. Actually, they most
definitely are suboptimal, because the fact that this provides any benefit
whatsoever means that either your VM isn't being backed with hugepages, or it's
being backed with THP and transparent_hugepage/use_zero_page is enabled (and thus
is generating CoW behavior).

Enabling THP or using HugeTLB (which again can be done on a subset of guest memory)
will have a far, far bigger impact on guest performance. Ditto for disabling
using the huge zero_page when backing VMs with THP (any page touched by the guest
is all but guaranteed to be written sooner than later, so using the zero_page
doesn't make a whole lot of sense).

E.g. a single CoW operation will take mmu_lock for write three times:
invalidate_range_start(), change_pte(), and invalidate_range_end(), not to mention
the THP zero_page CoW will first fault-in a read-only mapping, then split that
mapping, and then do CoW on the 4KiB PTEs, which is *really* suboptimal.

Actually, I don't even completely understand how you're seeing CoW behavior in
the first place. No sane guest should blindly read (or execute) uninitialized
memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK
QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off
KSM, because turning on KSM is antithetical to guest performance (not to mention
that KSM is wildly insecure for the guest, especially given the number of speculative
execution attacks these days).

If there's something else going on, i.e. if your VM really is somehow generating
reads before writes, and if we really want to optimize use cases that can't use
hugepages for whatever reason, I would much prefer to do something like add a
memslot flag to state that the memslot should *always* be mapped writable. Because
outside of setups that use KSM, the only reason I can think of to not map memory
writable straightaway is if userspace somehow knows the guest isn't going to write
that memory.

If it weren't for KSM, and if it wouldn't potentially be a breaking change, I
would even go so far as to say that KVM should always map writable memslots as
writable in the guest.

E.g. minus the uAPI, this is a lot simpler to implement and maintain.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfbaafbe3a00..6c4640483881 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2727,10 +2727,14 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
return KVM_PFN_NOSLOT;
}

- /* Do not map writable pfn in the readonly memslot. */
- if (writable && memslot_is_readonly(slot)) {
- *writable = false;
- writable = NULL;
+ if (writable) {
+ if (memslot_is_readonly(slot)) {
+ *writable = false;
+ writable = NULL;
+ } else if (memslot_is_always_writable(slot)) {
+ *writable = true;
+ write_fault = true;
+ }
}

return hva_to_pfn(addr, atomic, interruptible, async, write_fault,


And FWIW, removing .change_pte() entirely, even without any other optimizations,
will also benefit those guests, as it will remove a source of mmu_lock contention
along with all of the overhead of invoking callbacks, walking memslots, etc. And
removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks,
i.e. callbacks when memory for the VMM takes a CoW fault.

So yeah, unless I'm misunderstanding the bigger picture, the more I look at this,
the more I'm against it.