Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

From: Yan Zhao
Date: Wed Aug 16 2023 - 20:28:16 EST


On Wed, Aug 16, 2023 at 11:18:03AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Yan Zhao wrote:
> > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU
> > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching
> > notified new PFN into SPTEs directly in the handler.
> >
> > As in [1], .change_pte() has been dead code on x86 for 10+ years.
> > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare
> > for optimization in TDP MMU in patch 2.
>
> If we're going to officially kill the long-dead attempt at optimizing KSM, I'd
> strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all
> architectures and remove it from mmu_notifiers. The only reason I haven't proposed
> such patches is because I didn't want to it to backfire and lead to someone trying
> to resurrect the optimizations for KSM.
>
> > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the
> > handler directly with PFN info contained in .change_pte() to avoid that
> > each vCPU write that triggers .change_pte() must undergo twice VMExits and
> > TDP page faults.
>
> IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by
> using QEMU's prealloc option. It's more coarse grained, but at a minimum it's
> sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB.
>
> And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to
> prefault memory not just into the primary MMU, but also into KVM's MMU. Such an
> ioctl() is basically manadatory for TDX, we just need to morph the support being
> added by TDX into a generic ioctl()[*]
>
> Prefaulting guest memory as writable into the primary MMU should be able to achieve
> far better performance than hooking .change_pte(), as it will avoid the mmu_notifier
> invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote
> TLB flush(es). And a KVM ioctl() to prefault into KVM's MMU should eliminate page
> fault VM-Exits entirely.
>
> Explicit prefaulting isn't perfect, but IMO the value added by prefetching in
> .change_pte() isn't enough to justify carrying the hook and the code in KVM.
>
> [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@xxxxxxxxxx
Hi Sean,
As I didn't write the full picture of patch 2 in the cover letter well,
may I request you to take a look of patch 2 to see if you like it? (in
case if you just read the cover letter).
What I observed is that each vCPU write to a COW page in primary MMU
will lead to twice TDP page faults.
Then, I just update the secondary MMU during the first TDP page fault
to avoid the second one.
It's not a blind prefetch (I checked the vCPU to ensure it's triggered
by a vCPU operation as much as possible) and it can benefit guests who
doesn't explicitly request a prefault memory as write.


Thanks
Yan