Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

From: Yan Zhao
Date: Thu Aug 17 2023 - 20:41:17 EST


On Thu, Aug 17, 2023 at 09:38:37AM +0200, David Hildenbrand wrote:
> On 17.08.23 07:05, Yan Zhao wrote:
> > On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote:
> > > On 8/16/23 02:49, David Hildenbrand wrote:
> > > > But do 32bit architectures even care about NUMA hinting? If not, just
> > > > ignore them ...
> > >
> > > Probably not!
> > >
> > > ...
> > > > > So, do you mean that let kernel provide a per-VMA allow/disallow
> > > > > mechanism, and
> > > > > it's up to the user space to choose between per-VMA and complex way or
> > > > > global and simpler way?
> > > >
> > > > QEMU could do either way. The question would be if a per-vma settings
> > > > makes sense for NUMA hinting.
> > >
> > > From our experience with compute on GPUs, a per-mm setting would suffice.
> > > No need to go all the way to VMA granularity.
> > >
> > After an offline internal discussion, we think a per-mm setting is also
> > enough for device passthrough in VMs.
> >
> > BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you
> > think it's of any value to providing a flag like VM_MAYDMA?
> > Auto NUMA balancing or other components can decide how to use it by
> > themselves.
>
> Short-lived DMA is not really the problem. The problem is long-term pinning.
>
> There was a discussion about letting user space similarly hint that
> long-term pinning might/will happen.
>
> Because when long-term pinning a page we have to make sure to migrate it off
> of ZONE_MOVABLE / MIGRATE_CMA.
>
> But the kernel prefers to place pages there.
>
> So with vfio in QEMU, we might preallocate memory for the guest and place it
> on ZONE_MOVABLE/MIGRATE_CMA, just so long-term pinning has to migrate all
> these fresh pages out of these areas again.
>
> So letting the kernel know about that in this context might also help.
>
Thanks! Glad to know it :)
But consider for GPUs case as what John mentioned, since the memory is
not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ?
For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting
IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace
and migration.

Is that good?
Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for
now?