Re: [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node

From: Ben Gardon
Date: Mon Dec 05 2022 - 13:51:40 EST


On Mon, Dec 5, 2022 at 10:47 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> Side topic, the shortlog is way, way too long. The purpose of the shortlog is
> to provide a synopsis of the change, not to describe the change in detail.
>
> I also think this patch should be 2/2, with the more generic support added along
> with the module param (or capability) in 1/2. E.g. to yield something like
>
> KVM: x86/mmu: Add a module param to make per-vCPU SPTs NUMA aware
> KVM: x86/mmu: Honor NUMA awareness for per-VM page table allocations
>
> On Mon, Dec 05, 2022, Ben Gardon wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 4736d7849c60..0554dfc55553 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint");
> > > static bool __read_mostly force_flush_and_sync_on_reuse;
> > > module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
> > >
> > > +static bool __read_mostly numa_aware_pagetable = true;
> > > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644);
> > > +
> >
> > I'm usually all for having module params to control things, but in
> > this case I don't think it provides much value because whether this
> > NUMA optimization is useful or not is going to depend more on VM size
> > and workload than anything else. If we wanted to make this
> > configurable, a VM capability would probably be a better mechanism so
> > that userspace could leave it off when running small,
> > non-performance-sensitive VMs
>
> Would we actually want to turn it off in this case? IIUC, @nid is just the
> preferred node, i.e. failure to allocate for the preferred @nid will result in
> falling back to other nodes, not outright failure. So the pathological worst
> case scenario would be that for a system with VMs that don't care about performance,
> all of a nodes memory is allocated due to all VMs starting on that node.
>
> On the flip side, if a system had a mix of VM shapes, I think we'd want even the
> performance insensitive VMs to be NUMA aware so that they can be sequestered on
> their own node(s), i.e. don't "steal" memory from the VMs that are performance
> sensitive and have been affined to a single node.

Yeah, the only reason to turn it off would be to save memory. As a
strawman, if you had 100 1-vCPU VMs on a 2 node system, you'd have
4000 pages allocated in caches, doing nothing.

>
> > and turn it on when running large, multi-node VMs. A whole-host module
> > parameter seems overly restrictive.