[RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

From: Gregory Price
Date: Wed Nov 08 2023 - 19:25:36 EST


This patchset implements weighted interleave and adds a new cgroup
sysfs entry: cgroup/memory.interleave_weights (excluded from root).

The il_weight of a node is used by mempolicy to implement weighted
interleave when `numactl --interleave=...` is invoked. By default
il_weight for a node is always 1, which preserves the default round
robin interleave behavior.

Interleave weights denote the number of pages that should be
allocated from the node when interleaving occurs and have a range
of 1-255. The weight of a node can never be 0, and instead the
preferred way to prevent allocation is to remove the node from the
cpuset or mempolicy altogether.

For example, if a node's interleave weight is set to 5, 5 pages
will be allocated from that node before the next node is scheduled
for allocations.

# Set node weight for node 0 to 5
echo 0:5 > /sys/fs/cgroup/user.slice/memory.interleave_weights

# Set node weight for node 1 to 3
echo 1:3 > /sys/fs/cgroup/user.slice/memory.interleave_weights

# View the currently set weights
cat /sys/fs/cgroup/user.slice/memory.interleave_weights
0:5,1:3

Weights will only be displayed for possible nodes.

With this it becomes possible to set an interleaving strategy
that fits the available bandwidth for the devices available on
the system. An example system:

Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 1 - CXL Memory. 64GB/s BW, on Node 0 root complex

In this setup, the effective weights for a node set of [0,1]
may be may be [86, 14] (86% of memory on Node 0, 14% on node 1)
or some smaller fraction thereof to encourge quicker rounds
for better overall distribution.

This spreads memory out across devices which all have different
latency and bandwidth attributes in a way that can maximize the
available resources.

~Gregory

=============
Version Notes:

= v4 notes

Moved interleave weights to cgroups from nodes.

Omitted them from the root cgroup for initial testing/comment, but
it seems like it may be a reasonable idea to place them there too.

== Weighted interleave

mm/mempolicy: modify interleave mempolicy to use node weights

The mempolicy MPOL_INTERLEAVE utilizes the node weights defined in
the cgroup memory.interleave_weights interfaces to implement weighted
interleave. By default, since all nodes default to a weight of 1,
the original interleave behavior is retained.

============
RFC History

Node based weights
By: Gregory Price
https://lore.kernel.org/linux-mm/20231031003810.4532-1-gregory.price@xxxxxxxxxxxx/

Memory-tier based weights
By: Ravi Shankar
https://lore.kernel.org/all/20230927095002.10245-1-ravis.opensrc@xxxxxxxxxx/

Mempolicy multi-node weighting w/ set_mempolicy2:
By: Gregory Price
https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@xxxxxxxxxxxx/

Hasan Al Maruf: N:M weighting in mempolicy
https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@xxxxxxxxxxx/T/

Huang, Ying's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

===================

Gregory Price (3):
mm/memcontrol: implement memcg.interleave_weights
mm/mempolicy: implement weighted interleave
Documentation: sysfs entries for cgroup.memory.interleave_weights

Documentation/admin-guide/cgroup-v2.rst | 45 +++++
.../admin-guide/mm/numa_memory_policy.rst | 11 ++
include/linux/memcontrol.h | 31 ++++
include/linux/mempolicy.h | 3 +
mm/memcontrol.c | 172 ++++++++++++++++++
mm/mempolicy.c | 153 +++++++++++++---
6 files changed, 387 insertions(+), 28 deletions(-)

--
2.39.1