Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

From: David Hildenbrand
Date: Fri Aug 11 2023 - 13:26:29 EST


On 10.08.23 11:50, Yan Zhao wrote:
On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote:
This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
event is sent for NUMA migration purpose in specific.

Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
MMU to avoid NUMA protection introduced page faults and restoration of old
huge PMDs/PTEs in primary MMU.

Patch 3 introduces a new mmu notifier callback .numa_protect(), which
will be called in patch 4 when a page is ensured to be PROT_NONE protected.

Then in patch 5, KVM can recognize a .invalidate_range_start() notification
is for NUMA balancing specific and do not do the page unmap in secondary
MMU until .numa_protect() comes.


Why do we need all that, when we should simply not be applying PROT_NONE to
pinned pages?

In change_pte_range() we already have:

if (is_cow_mapping(vma->vm_flags) &&
page_count(page) != 1)

Which includes both, shared and pinned pages.
Ah, right, currently in my side, I don't see any pinned pages are
outside of this condition.
But I have a question regarding to is_cow_mapping(vma->vm_flags), do we
need to allow pinned pages in !is_cow_mapping(vma->vm_flags)?

One issue is that folio_maybe_pinned...() ... is unreliable as soon as your page is mapped more than 1024 times.

One might argue that we also want to exclude pages that are mapped that often. That might possibly work.


Staring at page #2, are we still missing something similar for THPs?
Yes.

Why is that MMU notifier thingy and touching KVM code required?
Because NUMA balancing code will firstly send .invalidate_range_start() with
event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
unconditionally, before it goes down into change_pte_range() and
change_huge_pmd() to check each page count and apply PROT_NONE.

Ah, okay I see, thanks. That's indeed unfortunate.


Then current KVM will unmap all notified pages from secondary MMU
in .invalidate_range_start(), which could include pages that finally not
set to PROT_NONE in primary MMU.

For VMs with pass-through devices, though all guest pages are pinned,
KVM still periodically unmap pages in response to the
.invalidate_range_start() notification from auto NUMA balancing, which
is a waste.

Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it.

--
Cheers,

David / dhildenb