Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

From: Yan Zhao
Date: Thu Aug 10 2023 - 06:17:31 EST


On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote:
> > This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
> > to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
> > the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
> > event is sent for NUMA migration purpose in specific.
> >
> > Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
> > MMU to avoid NUMA protection introduced page faults and restoration of old
> > huge PMDs/PTEs in primary MMU.
> >
> > Patch 3 introduces a new mmu notifier callback .numa_protect(), which
> > will be called in patch 4 when a page is ensured to be PROT_NONE protected.
> >
> > Then in patch 5, KVM can recognize a .invalidate_range_start() notification
> > is for NUMA balancing specific and do not do the page unmap in secondary
> > MMU until .numa_protect() comes.
> >
>
> Why do we need all that, when we should simply not be applying PROT_NONE to
> pinned pages?
>
> In change_pte_range() we already have:
>
> if (is_cow_mapping(vma->vm_flags) &&
> page_count(page) != 1)
>
> Which includes both, shared and pinned pages.
Ah, right, currently in my side, I don't see any pinned pages are
outside of this condition.
But I have a question regarding to is_cow_mapping(vma->vm_flags), do we
need to allow pinned pages in !is_cow_mapping(vma->vm_flags)?

> Staring at page #2, are we still missing something similar for THPs?
Yes.

> Why is that MMU notifier thingy and touching KVM code required?
Because NUMA balancing code will firstly send .invalidate_range_start() with
event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
unconditionally, before it goes down into change_pte_range() and
change_huge_pmd() to check each page count and apply PROT_NONE.

Then current KVM will unmap all notified pages from secondary MMU
in .invalidate_range_start(), which could include pages that finally not
set to PROT_NONE in primary MMU.

For VMs with pass-through devices, though all guest pages are pinned,
KVM still periodically unmap pages in response to the
.invalidate_range_start() notification from auto NUMA balancing, which
is a waste.

So, if there's a new callback sent when pages is set to PROT_NONE for NUMA
migrate only, KVM can unmap only those pages.
As KVM still needs to unmap pages for other type of event in its handler of
.invalidate_range_start() (.i.e. kvm_mmu_notifier_invalidate_range_start()),
and MMU_NOTIFY_PROTECTION_VMA also include other reasons, so patch 1
added a range flag to help KVM not to do a blind unmap in
.invalidate_range_start(), but do it in the new .numa_protect() handler.

>
> --
> Cheers,
>
> David / dhildenb
>
>