Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting NUMA aware

From: Sean Christopherson
Date: Mon Aug 01 2022 - 19:56:30 EST


On Mon, Aug 01, 2022, David Matlack wrote:
> On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote:
> That being said, KVM currently has a gap where a guest doing a lot of
> remote memory accesses when touching memory for the first time will
> cause KVM to allocate the TDP page tables on the arguably wrong node.

Userspace can solve this by setting the NUMA policy on a VMA or shared-object
basis. E.g. create dedicated memslots for each NUMA node, then bind each of the
backing stores to the appropriate host node.

If there is a gap, e.g. a backing store we want to use doesn't properly support
mempolicy for shared mappings, then we should enhance the backing store.

> > We can improve TDP MMU eager page splitting by making
> > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a
> > huge page, allocate the new lower level page tables on the same node as the
> > huge page.
> >
> > __get_free_page() is replaced by alloc_page_nodes(). This introduces two
> > functional changes.
> >
> > 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to
> > __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is
> > not passed in tdp_mmu_alloc_sp_for_split() anyway.
> >
> > 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for
> > the NUMA node allocation. From this commit, thread's mempolicy will not
> > be used and first preference will be to allocate on the node where huge
> > page was present.
>
> It would be worth noting that userspace could change the mempolicy of
> the thread doing eager splitting to prefer allocating from the target
> NUMA node, as an alternative approach.
>
> I don't prefer the alternative though since it bleeds details from KVM
> into userspace, such as the fact that enabling dirty logging does eager
> page splitting, which allocates page tables.

As above, if userspace is cares about vNUMA, then it already needs to be aware of
some of KVM/kernel details. Separate memslots aren't strictly necessary, e.g.
userspace could stitch together contiguous VMAs to create a single mega-memslot,
but that seems like it'd be more work than just creating separate memslots.

And because eager page splitting for dirty logging runs with mmu_lock held for,
userspace might also benefit from per-node memslots as it can do the splitting on
multiple tasks/CPUs.

Regardless of what we do, the behavior needs to be document, i.e. KVM details will
bleed into userspace. E.g. if KVM is overriding the per-task NUMA policy, then
that should be documented.

> It's also unnecessary since KVM can infer an appropriate NUMA placement
> without the help of userspace, and I can't think of a reason for userspace to
> prefer a different policy.

I can't think of a reason why userspace would want to have a different policy for
the task that's enabling dirty logging, but I also can't think of a reason why
KVM should go out of its way to ignore that policy.

IMO this is a "bug" in dirty_log_perf_test, though it's probably a good idea to
document how to effectively configure vNUMA-aware memslots.