Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

From: Yan Zhao
Date: Fri Aug 18 2023 - 06:45:53 EST


On Thu, Aug 17, 2023 at 10:53:25AM -0700, Sean Christopherson wrote:
...

> > What I observed is that each vCPU write to a COW page in primary MMU
> > will lead to twice TDP page faults.
> > Then, I just update the secondary MMU during the first TDP page fault
> > to avoid the second one.
> > It's not a blind prefetch (I checked the vCPU to ensure it's triggered
> > by a vCPU operation as much as possible)
>
> Yes, that's part of the complexity I don't like.
>
> > and it can benefit guests who doesn't explicitly request a prefault memory as
> > write.
>
> Yes, I'm arguing that the benefit isn't significant, and that the use cases it
> might benefit aren't things people care about optimizing.
>
> I'm very skeptical that shaving those 8000 VM-Exits will translate to a meaningful
> reduction in guest boot time, let alone scale beyond very specific scenarios and
> configurations, which again, are likely suboptimal in nature. Actually, they most
> definitely are suboptimal, because the fact that this provides any benefit
> whatsoever means that either your VM isn't being backed with hugepages, or it's
> being backed with THP and transparent_hugepage/use_zero_page is enabled (and thus
> is generating CoW behavior).
Yes, it's being backed with THP and transparent_hugepage/use_zero_page is enabled.

>
> Enabling THP or using HugeTLB (which again can be done on a subset of guest memory)
> will have a far, far bigger impact on guest performance. Ditto for disabling
> using the huge zero_page when backing VMs with THP (any page touched by the guest
> is all but guaranteed to be written sooner than later, so using the zero_page
> doesn't make a whole lot of sense).
>
> E.g. a single CoW operation will take mmu_lock for write three times:
> invalidate_range_start(), change_pte(), and invalidate_range_end(), not to mention
> the THP zero_page CoW will first fault-in a read-only mapping, then split that
> mapping, and then do CoW on the 4KiB PTEs, which is *really* suboptimal.
>
> Actually, I don't even completely understand how you're seeing CoW behavior in
> the first place. No sane guest should blindly read (or execute) uninitialized
> memory. IIUC, you're not running a Windows guest, and even if you are, AFAIK
> QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> been zeroed by the hypervisor. If KSM is to blame, then my answer it to turn off
> KSM, because turning on KSM is antithetical to guest performance (not to mention
> that KSM is wildly insecure for the guest, especially given the number of speculative
> execution attacks these days).
I'm running a linux guest.
KSM is not turned on both in guest and host.
Both guest and host have turned on transparent huge page.

The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
which will cause
(1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
mapped.
(2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
which will fail

The guest then writes the GFN.
This step will trigger (huge pmd split for huge page case) and .change_pte().

My guest is surely a sane guest. But currently I can't find out why
certain pages are read before write.
Will return back to you the reason after figuring it out after my long vacation.

>
> If there's something else going on, i.e. if your VM really is somehow generating
> reads before writes, and if we really want to optimize use cases that can't use
> hugepages for whatever reason, I would much prefer to do something like add a
> memslot flag to state that the memslot should *always* be mapped writable. Because
Will check if this flag is necessary after figuring out the reason.

> outside of setups that use KSM, the only reason I can think of to not map memory
> writable straightaway is if userspace somehow knows the guest isn't going to write
> that memory.
>
> If it weren't for KSM, and if it wouldn't potentially be a breaking change, I
> would even go so far as to say that KVM should always map writable memslots as
> writable in the guest.
>
> E.g. minus the uAPI, this is a lot simpler to implement and maintain.
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dfbaafbe3a00..6c4640483881 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2727,10 +2727,14 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> return KVM_PFN_NOSLOT;
> }
>
> - /* Do not map writable pfn in the readonly memslot. */
> - if (writable && memslot_is_readonly(slot)) {
> - *writable = false;
> - writable = NULL;
> + if (writable) {
> + if (memslot_is_readonly(slot)) {
> + *writable = false;
> + writable = NULL;
> + } else if (memslot_is_always_writable(slot)) {
> + *writable = true;
> + write_fault = true;
> + }
> }
>
> return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
>
>

> And FWIW, removing .change_pte() entirely, even without any other optimizations,
> will also benefit those guests, as it will remove a source of mmu_lock contention
> along with all of the overhead of invoking callbacks, walking memslots, etc. And
> removing .change_pte() will benefit *all* guests by eliminating unrelated callbacks,
> i.e. callbacks when memory for the VMM takes a CoW fault.
>
If with above "always write_fault = true" solution, I think it's better.

> So yeah, unless I'm misunderstanding the bigger picture, the more I look at this,
> the more I'm against it.