Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

From: Sean Christopherson
Date: Fri Aug 11 2023 - 13:14:58 EST


On Fri, Aug 11, 2023, Yan Zhao wrote:
> On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> >
> > 在 2023/8/11 11:45, Yan Zhao 写道:
> > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > >>> + struct mm_struct *mm,
> > >>> + unsigned long start,
> > >>> + unsigned long end)
> > >>> +{
> > >>> + struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > >>> +
> > >>> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > >>> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > >>> + return;
> > >>> +
> > >>> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > >>> +}
> > >> numa balance will scan wide memory range, and there will be one time
> > > Though scanning memory range is wide, .invalidate_range_start() is sent
> > > for each 2M range.
> > yes, range is huge page size when changing numa protection during numa scanning.
> >
> > >
> > >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> > >> it may bring out lots of flush remote tlb ipi notification.
> > >
> > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > > will be reduced to 0 with this series.
> > >
> > > For VMs without assigned devices or mdev devices, I was previously also
> > > worried about that there might be more IPIs.
> > > But with current test data, there's no more remote tlb IPIs on average.
> > >
> > > The reason is below:
> > >
> > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > > range.

No, it's potentially called once per 1GiB range. change_pmd_range() invokes the
mmu_notifier with addr+end, where "end" is the end of the range covered by the
PUD, not the the end of the current PMD. So the worst case scenario would be a
256k increase. Of course, if you have to migrate an entire 1GiB chunk of memory
then you likely have bigger problems, but still.

> > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > if mapped to 4K pages in primary MMU.
> > >
> > >
> > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > I do not know much about x86, does this happen always or only need reschedule
> Ah, sorry, I missed platforms other than x86.
> Maybe there will be a big difference in other platforms.
>
> > from code? so that there will be many times of tlb IPIs in only once function
> Only when MMU lock is contended. But it's not seldom because of the contention in
> TDP page fault.

No? I don't see how mmu_lock contention would affect the number of calls to
kvm_flush_remote_tlbs(). If vCPUs are running and not generating faults, i.e.
not trying to access the range in question, then ever zap will generate a remote
TLB flush and thus send IPIs to all running vCPUs.

> > call about kvm_unmap_gfn_range.
> >
> > > if the pages are mapped in 4K in secondary MMU.
> > >
> > > With this series, on the other hand, .numa_protect() sets range to be
> > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > mapped into small PTEs in secondary MMU.
> > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > whether the page is accessed for different cpu threads cross-nodes.
> The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> the page.

I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA. It's too much of a one-off,
and losing the batching of invalidations makes me nervous. As Bibo points out,
the behavior will vary based on the workload, VM configuration, etc.

There's also a *very* subtle change, in that the notification will be sent while
holding the PMD/PTE lock. Taking KVM's mmu_lock under that is *probably* ok, but
I'm not exactly 100% confident on that. And the only reason there isn't a more
obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
won't yield if there's mmu_lock contention.

However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
I think we can achieve what you want without losing batching, and without changing
secondary MMUs.

Rather than muck with the notification types and add a one-off for NUMA, just
defer the notification until a present PMD/PTE is actually going to be modified.
It's not the prettiest, but other than the locking, I don't think has any chance
of regressing other workloads/configurations.

Note, I'm assuming secondary MMUs aren't allowed to map swap entries...

Compile tested only.

From: Sean Christopherson <seanjc@xxxxxxxxxx>
Date: Fri, 11 Aug 2023 10:03:36 -0700
Subject: [PATCH] tmp

Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
---
include/linux/huge_mm.h | 4 +++-
include/linux/mm.h | 3 +++
mm/huge_memory.c | 5 ++++-
mm/mprotect.c | 47 +++++++++++++++++++++++++++++------------
4 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 20284387b841..b88316adaaad 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -7,6 +7,8 @@

#include <linux/fs.h> /* only for vma_is_dax() */

+struct mmu_notifier_range;
+
vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
@@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr, pgprot_t newprot,
- unsigned long cp_flags);
+ unsigned long cp_flags, struct mmu_notifier_range *range);

vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..284f61cf9c37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
return !!(vma->vm_flags & VM_WRITE);

}
+
+void change_pmd_range_notify_secondary_mmus(unsigned long addr,
+ struct mmu_notifier_range *range);
bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
extern long change_protection(struct mmu_gather *tlb,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a71cf686e3b2..47c7712b163e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
*/
int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr, pgprot_t newprot,
- unsigned long cp_flags)
+ unsigned long cp_flags, struct mmu_notifier_range *range)
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
!toptier)
xchg_page_access_time(page, jiffies_to_msecs(jiffies));
}
+
+ change_pmd_range_notify_secondary_mmus(addr, range);
+
/*
* In case prot_numa, we are under mmap_read_lock(mm). It's critical
* to not clear pmd intermittently to avoid race with MADV_DONTNEED
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d1a809167f49..f5844adbe0cb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,

static long change_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
- unsigned long end, pgprot_t newprot, unsigned long cp_flags)
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags,
+ struct mmu_notifier_range *range)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
@@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb,
!toptier)
xchg_page_access_time(page,
jiffies_to_msecs(jiffies));
+
+
}

+ change_pmd_range_notify_secondary_mmus(addr, range);
+
oldpte = ptep_modify_prot_start(vma, addr, pte);
ptent = pte_modify(oldpte, newprot);

@@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
err; \
})

+void change_pmd_range_notify_secondary_mmus(unsigned long addr,
+ struct mmu_notifier_range *range)
+{
+ if (range->start)
+ return;
+
+ VM_WARN_ON(addr >= range->end);
+ range->start = addr;
+ mmu_notifier_invalidate_range_start_nonblock(range);
+}
+
static inline long change_pmd_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
unsigned long nr_huge_updates = 0;
struct mmu_notifier_range range;

- range.start = 0;
+ /*
+ * Defer invalidation of secondary MMUs until a PMD/PTE change is
+ * imminent, e.g. NUMA balancing in particular can "fail" for certain
+ * types of mappings. Initialize range.start to '0' and use it to
+ * track whether or not the invalidation notification has been set.
+ */
+ mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
+ vma->vm_mm, 0, end);

pmd = pmd_offset(pud, addr);
do {
@@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
if (pmd_none(*pmd))
goto next;

- /* invoke the mmu notifier if the pmd is populated */
- if (!range.start) {
- mmu_notifier_range_init(&range,
- MMU_NOTIFY_PROTECTION_VMA, 0,
- vma->vm_mm, addr, end);
- mmu_notifier_invalidate_range_start(&range);
- }
-
_pmd = pmdp_get_lockless(pmd);
if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
if ((next - addr != HPAGE_PMD_SIZE) ||
pgtable_split_needed(vma, cp_flags)) {
+ /*
+ * FIXME: __split_huge_pmd() performs its own
+ * mmu_notifier invalidation prior to clearing
+ * the PMD, ideally all invalidations for the
+ * range would be batched.
+ */
__split_huge_pmd(vma, pmd, addr, false, NULL);
/*
* For file-backed, the pmd could have been
@@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
break;
}
} else {
- ret = change_huge_pmd(tlb, vma, pmd,
- addr, newprot, cp_flags);
+ ret = change_huge_pmd(tlb, vma, pmd, addr,
+ newprot, cp_flags, &range);
if (ret) {
if (ret == HPAGE_PMD_NR) {
pages += HPAGE_PMD_NR;
@@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
}

ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
- cp_flags);
+ cp_flags, &range);
if (ret < 0)
goto again;
pages += ret;

base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad
--