Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

From: Huang, Ying
Date: Fri Nov 10 2023 - 15:31:57 EST


Gregory Price <gourry.memverge@xxxxxxxxx> writes:

> This patchset implements weighted interleave and adds a new cgroup
> sysfs entry: cgroup/memory.interleave_weights (excluded from root).
>
> The il_weight of a node is used by mempolicy to implement weighted
> interleave when `numactl --interleave=...` is invoked. By default
> il_weight for a node is always 1, which preserves the default round
> robin interleave behavior.

IIUC, this makes it almost impossible to set the default weight of a
node from the node memory bandwidth information. This will make the
life of users a little harder.

If so, how about use a new memory policy mode, for example
MPOL_WEIGHTED_INTERLEAVE, etc.

> Interleave weights denote the number of pages that should be
> allocated from the node when interleaving occurs and have a range
> of 1-255. The weight of a node can never be 0, and instead the
> preferred way to prevent allocation is to remove the node from the
> cpuset or mempolicy altogether.
>
> For example, if a node's interleave weight is set to 5, 5 pages
> will be allocated from that node before the next node is scheduled
> for allocations.
>
> # Set node weight for node 0 to 5
> echo 0:5 > /sys/fs/cgroup/user.slice/memory.interleave_weights
>
> # Set node weight for node 1 to 3
> echo 1:3 > /sys/fs/cgroup/user.slice/memory.interleave_weights
>
> # View the currently set weights
> cat /sys/fs/cgroup/user.slice/memory.interleave_weights
> 0:5,1:3
>
> Weights will only be displayed for possible nodes.
>
> With this it becomes possible to set an interleaving strategy
> that fits the available bandwidth for the devices available on
> the system. An example system:
>
> Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> Node 1 - CXL Memory. 64GB/s BW, on Node 0 root complex
>
> In this setup, the effective weights for a node set of [0,1]
> may be may be [86, 14] (86% of memory on Node 0, 14% on node 1)
> or some smaller fraction thereof to encourge quicker rounds
> for better overall distribution.
>
> This spreads memory out across devices which all have different
> latency and bandwidth attributes in a way that can maximize the
> available resources.
>

--
Best Regards,
Huang, Ying