[RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

From: Yan Zhao
Date: Thu Aug 10 2023 - 05:23:37 EST


This is an RFC series trying to fix the issue of unnecessary NUMA
protection and TLB-shootdowns found in VMs with assigned devices or VFIO
mediated devices during NUMA balance.

For VMs with assigned devices or VFIO mediated devices, all or part of
guest memory are pinned for long-term.

Auto NUMA balancing will periodically selects VMAs of a process and change
protections to PROT_NONE even though some or all pages in the selected
ranges are long-term pinned for DMAs, which is true for VMs with assigned
devices or VFIO mediated devices.

Though this will not cause real problem because NUMA migration will
ultimately reject migration of those kind of pages and restore those
PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
with equal SPTEs finally faulted back, wasting CPU cycles and generating
unnecessary TLB-shootdowns.

This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
event is sent for NUMA migration purpose in specific.

Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
MMU to avoid NUMA protection introduced page faults and restoration of old
huge PMDs/PTEs in primary MMU.

Patch 3 introduces a new mmu notifier callback .numa_protect(), which
will be called in patch 4 when a page is ensured to be PROT_NONE protected.

Then in patch 5, KVM can recognize a .invalidate_range_start() notification
is for NUMA balancing specific and do not do the page unmap in secondary
MMU until .numa_protect() comes.


Changelog:
RFC v1 --> v2:
1. added patch 3-4 to introduce a new callback .numa_protect()
2. Rather than have KVM duplicate logic to check if a page is pinned for
long-term, let KVM depend on the new callback .numa_protect() to do the
page unmap in secondary MMU for NUMA migration purpose.

RFC v1:
https://lore.kernel.org/all/20230808071329.19995-1-yan.y.zhao@xxxxxxxxx/

Yan Zhao (5):
mm/mmu_notifier: introduce a new mmu notifier flag
MMU_NOTIFIER_RANGE_NUMA
mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
purpose
mm/mmu_notifier: introduce a new callback .numa_protect
mm/autonuma: call .numa_protect() when page is protected for NUMA
migrate
KVM: Unmap pages only when it's indeed protected for NUMA migration

include/linux/mmu_notifier.h | 16 ++++++++++++++++
mm/huge_memory.c | 6 ++++++
mm/mmu_notifier.c | 18 ++++++++++++++++++
mm/mprotect.c | 10 +++++++++-
virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++---
5 files changed, 71 insertions(+), 4 deletions(-)

--
2.17.1