Re: [RFC] mm: add new syscall pidfd_set_mempolicy()

From: Michal Hocko
Date: Wed Oct 12 2022 - 09:07:58 EST


On Wed 12-10-22 07:34:06, Vinicius Petrucci wrote:
> > Well, per address range operation is a completely different beast I
> > would say. External tool would need to a) understand what that range is
> > used for (e.g. stack/heap ranges, mmaped shared files like libraries or
> > private mappings) and b) by in sync with memory layout modifications
> > done by applications (e.g. that an mmap has been issued to back malloc
> > request). Quite a lot of understanding about the specific process. I
> > would say that with that intimate knowledge it is quite better to be
> > part of the process and do those changes from within of the process
> > itself.
>
> Sorry, this may be a digression, but just wanted to mention a
> particular use case from a project I recently collaborated on (to
> appear next month at IIWSC 2022:
> http://www.iiswc.org/iiswc2022/index.html).
>
> We carried out a performance analysis of the latest Linux AutoNUMA
> memory tiering on graph processing applications. We noticed that hot
> pages cannot be properly identified by the reactive approach used by
> AutoNUMA due to irregular/random memory access patterns.

Yes, I can see how a reactive approach might not be the best fit.
Automatic NUMA balancing can help quite a lot where memory regions
are accessed consistently. I can imagine situations where the user space
agent can tell much better what is the best node to place data when the
access pattern is not obvbious or hard to deduce from local metrics.

My main argument is though that those are rather specialized and it is
much easier to implement the agent as a part of the process as they are
unlikely to be generic enough to serve many different processes. I might
be wrong in this of course and I am also not saying that pidfd_mbind is
a completely unreasonable idea. We just need a strong usecase before
going that way.

> Thus, as a
> POC, we implemented and evaluated a simple idea of having an external
> user-level process/agent that, based on prior profiling results of
> memory regions, could make more effectively memory chunk/object-based
> mappings (instead of page-level allocation/migration) in advance on
> either DRAM or CXL/PMEM (via mbind calls). This kind of tiering
> solution could deliver up to 2x more performance for graph analytics
> workloads. We plan to evaluate other workloads as well.
>
> Having a feature like "pidfd/process_mbind" would really simplify our
> user-level agent implementation moving forward, as right now we are
> adding a LD_PRELOAD wrapper (for signal handler) to listen and execute
> "mbind" requests from another process. If there's any other
> alternative solution to this already (via ptrace?), please let me
> know.

userfaultfd sounds like the closest match if #PF handling under control
of an external agent is viable.
--
Michal Hocko
SUSE Labs