[PATCH v2 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing

From: Sean Christopherson
Date: Thu Dec 23 2021 - 17:23:49 EST


Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the
number of TLB flushes, fix soft lockups and RCU stalls, avoid blocking
vCPUs for long durations while zapping paging structure, and to clean up
the zapping code.

Patches 01-03 were allegedly queued when posted separately, but they haven't
showed up yet, and this series depends/conflicts on/with them, so here they
are again.

Based on kvm/queue-5.17, commit 1c4261809af0 ("KVM: SVM: include CR3 ...").

The largest cleanup is to separate the flows for zapping roots (zap
_everything_), zapping leaf SPTEs (zap guest mappings for whatever reason),
and zapping a specific SP (NX recovery). They're currently smushed into a
single zap_gfn_range(), which was a good idea at the time, but became a
mess when trying to handle the different rules, e.g. TLB flushes aren't
needed when zapping a root because KVM can safely zap a root if and only
if it's unreachable.

To solve the soft lockups, stalls, and vCPU performance issues:

- Defer remote TLB flushes to the caller when zapping TDP MMU shadow
pages by relying on RCU to ensure the paging structure isn't freed
until all vCPUs have exited the guest.

- Allowing yielding when zapping TDP MMU roots in response to the root's
last reference being put. This requires a bit of trickery to ensure
the root is reachable via mmu_notifier, but it's not too gross.

- Zap roots in two passes to avoid holding RCU for potential hundreds of
seconds when zapping guest with terabytes of memory that is backed
entirely by 4kb SPTEs.

- Zap defunct roots asynchronously via the common work_queue so that a
vCPU doesn't get stuck doing the work if the vCPU happens to drop the
last reference to a root.

The selftest at the end allows populating a guest with the max amount of
memory allowed by the underlying architecture. The most I've tested is
~64tb (MAXPHYADDR=46) as I don't have easy access to a system with
MAXPHYADDR=52. The selftest compiles on arm64 and s390x, but otherwise
hasn't been tested outside of x86-64. It will hopefully do something
useful as is, but there's a non-zero chance it won't get past init with
a high max memory. Running on x86 without the TDP MMU is comically slow.

v2:
- Drop patches that were applied.
- Collect reviews for patches that weren't modified. [Ben]
- Abandon the idea of taking invalid roots off the list of roots.
- Add a patch to fix misleading/wrong comments with respect to KVM's
responsibilities in the "fast zap" flow, specifically that all SPTEs
must be dropped before the zap completes.
- Rework yielding in kvm_tdp_mmu_put_root() to keep the root visibile
while yielding.
- Add patch to zap roots in two passes. [Mingwei, David]
- Add a patch to asynchronously zap defunct roots.
- Add the selftest.

v1: https://lore.kernel.org/all/20211120045046.3940942-1-seanjc@xxxxxxxxxx

Sean Christopherson (30):
KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap
hook
KVM: x86/mmu: Move "invalid" check out of kvm_tdp_mmu_get_root()
KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU
KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots
KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP
MMU
KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap
KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush
logic
KVM: x86/mmu: Document that zapping invalidated roots doesn't need to
flush
KVM: x86/mmu: Drop unused @kvm param from kvm_tdp_mmu_get_root()
KVM: x86/mmu: Require mmu_lock be held for write in unyielding root
iter
KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP
removal
KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier
change_spte
KVM: x86/mmu: Drop RCU after processing each root in MMU notifier
hooks
KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU
KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path
KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw
vals
KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery
KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU
KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page
KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range
KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()
KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU
resched
KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow
pages
KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU
root
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
KVM: x86/mmu: Zap defunct roots via asynchronous worker
KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils
KVM: selftests: Split out helper to allocate guest mem via memfd
KVM: selftests: Define cpu_relax() helpers for s390 and x86
KVM: selftests: Add test to populate a VM with the max possible guest
mem

arch/x86/kvm/mmu/mmu.c | 42 +-
arch/x86/kvm/mmu/mmu_internal.h | 16 +-
arch/x86/kvm/mmu/tdp_iter.c | 6 +-
arch/x86/kvm/mmu/tdp_iter.h | 15 +-
arch/x86/kvm/mmu/tdp_mmu.c | 642 ++++++++++++------
arch/x86/kvm/mmu/tdp_mmu.h | 32 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 6 +
.../selftests/kvm/include/s390x/processor.h | 8 +
.../selftests/kvm/include/x86_64/processor.h | 5 +
tools/testing/selftests/kvm/lib/kvm_util.c | 66 +-
.../selftests/kvm/max_guest_memory_test.c | 292 ++++++++
.../selftests/kvm/set_memory_region_test.c | 35 +-
14 files changed, 870 insertions(+), 299 deletions(-)
create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c

--
2.34.1.448.ga2b2bfdf31-goog