[PATCH 2/2] KVM: x86/mmu: prefetch SPTE directly in x86 TDP MMU's change_pte() handler

From: Yan Zhao
Date: Tue Aug 08 2023 - 11:57:14 EST


Optimize TDP MMU's .change_pte() handler to prefetch SPTEs in the handler
directly with PFN info contained in .change_pte() to avoid that each vCPU
write that triggers .change_pte() must undergo twice VMExits and TDP page
faults.

When there's a running vCPU on current pCPU, .change_pte() is probably
caused by a vCPU write to a guest page previously faulted in with a vCPU
read.

Detailed sequence as below:
1. vCPU reads to a guest page. Though the page is in RW memslot, both
primary MMU and KVM's secondary MMU are mapped with read-only PTEs
during page fault.
2. vCPU writes to this guest page.
3. VMExit and kvm_tdp_mmu_page_fault() calls GUP and COW are triggered, so
.invalidate_range_start(), .change_pte() and .invalidate_range_end()
are call successively.
4. kvm_tdp_mmu_page_fault() returns retry because it will always find
current page fault is stale because of the increased mmu_invalidate_seq
in .invalidate_range_end().
5. VMExit and page fault again.
6. Writable SPTE is mapped successfully.

That is, each guest write to a COW page must trigger VMExit and KVM TDP
page fault twice though .change_pte() has notified KVM the new PTE to be
mapped.

Since .change_pte() is called in a point that's ensured to succeed in
primary MMU, prefetch the new PFN directly in .change_pte() handler on
secondary MMU (KVM MMU) can save KVM the second VMExit and TDP page fault.

During tests on my environment with 8 vCPUs and 16G memory with no assigned
devices, there're around 8000+ (with OVMF) and 17000+ (with Seabios) TDP
page faults saved during each VM boot-up; around 44000+ TDP page faults
saved during booting a L2 VM with 2G memory.

Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
---
arch/x86/kvm/mmu/tdp_mmu.c | 69 +++++++++++++++++++++++++++++++++++++-
1 file changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 89a1f222e823..672a1e333c92 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1243,10 +1243,77 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
*/
bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
+ struct kvm_mmu_page *root;
+ struct kvm_mmu_page *sp;
+ bool wrprot, writable;
+ struct kvm_vcpu *vcpu;
+ struct tdp_iter iter;
+ bool flush = false;
+ kvm_pfn_t pfn;
+ u64 new_spte;
+
/* Huge pages aren't expected to be modified */
WARN_ON(pte_huge(range->arg.pte) || range->start + 1 != range->end);

- return false;
+ /*
+ * Get current running vCPU to be used in below prefetch in make_spte().
+ * If no running vCPU, .change_pte() is probably not triggered by vCPU
+ * writes, drop prefetching SPTEs in that case.
+ * Also only prefetch for L1 vCPUs.
+ * If later the vCPU is scheduled out, it's still all right to prefetch
+ * with the same vCPU except the prefetched SPTE may not be accessed
+ * immediately.
+ */
+ vcpu = kvm_get_running_vcpu();
+ if (!vcpu || vcpu->kvm != kvm || is_guest_mode(vcpu))
+ return flush;
+
+ writable = !(range->slot->flags & KVM_MEM_READONLY) && pte_write(range->arg.pte);
+ pfn = pte_pfn(range->arg.pte);
+
+ /* Do not allow rescheduling just as kvm_tdp_mmu_handle_gfn() */
+ for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+ rcu_read_lock();
+
+ tdp_root_for_each_pte(iter, root, range->start, range->end) {
+ if (iter.level > PG_LEVEL_4K)
+ continue;
+
+ sp = sptep_to_sp(rcu_dereference(iter.sptep));
+
+ /* make the SPTE as prefetch */
+ wrprot = make_spte(vcpu, sp, range->slot, ACC_ALL, iter.gfn,
+ pfn, iter.old_spte, true, true, writable,
+ &new_spte);
+ /*
+ * Do not prefetch new PFN for page tracked GFN
+ * as we want page fault handler to be triggered later
+ */
+ if (wrprot)
+ continue;
+
+ /*
+ * Warn if an existing SPTE is found becasuse it must not happen:
+ * .change_pte() must be surrounded by .invalidate_range_{start,end}(),
+ * so (1) kvm_unmap_gfn_range() should have zapped the old SPTE,
+ * (2) page fault handler should not be able to install new SPTE until
+ * .invalidate_range_end() completes.
+ *
+ * Even if the warn is hit and flush is true,
+ * (which indicates bugs in mmu notifier handler),
+ * there's no need to handle the remote TLB flush under RCU protection,
+ * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
+ * shadow page.
+ */
+ flush = WARN_ON(is_shadow_present_pte(iter.old_spte));
+ tdp_mmu_iter_set_spte(kvm, &iter, new_spte);
+
+ }
+
+ rcu_read_unlock();
+ }
+
+ return flush;
}

/*
--
2.17.1