Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

From: Gregory Price
Date: Wed Nov 01 2023 - 12:59:15 EST


On Wed, Nov 01, 2023 at 02:45:50PM +0100, Michal Hocko wrote:
> On Tue 31-10-23 00:27:04, Gregory Price wrote:
[... snip ...]
> >
> > The downside of doing it in mempolicy is...
> > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
> > non-trivial task. It is very "current-task" centric.
>
> True. Cpusets is the way to make it less process centric but that comes
> with its own constains (namely which NUMA policies are supported).
>
> > 2) Barring a change to mempolicy to be sysfs friendly, the options for
> > implementing weights in the mempolicy are either a) new flag and
> > setting every weight individually in many syscalls, or b) a new
> > syscall (set_mempolicy2), which is what I demonstrated in the RFC.
>
> Yes, that would likely require a new syscall.
>
> > 3) mempolicy is also subject to cgroup nodemasks, and as a result you
> > end up with a rats nest of interactions between mempolicy nodemasks
> > changing as a result of cgroup migrations, nodes potentially coming
> > and going (hotplug under CXL), and others I'm probably forgetting.
>
> Is this really any different from what you are proposing though?
>

In only one manner: An external user can set the weight of a node that
is added later on. If it is implemented in mempolicy, then this is not
possible.

Basically consider: `numactl --interleave=all ...`

If `--weights=...`: when a node hotplug event occurs, there is no
recourse for adding a weight for the new node (it will default to 1).

Maybe the answer is "Best effort, sorry" and we don't handle that
situation. That doesn't seem entirely unreasonable.

At least with weights in node (or cgroup, or memtier, whatever) it
provides the ability to set that weight outside the mempolicy context.

> > weight, or should you reset it? If a new node comes into the node
> > mask... what weight should you set? I did not have answers to these
> > questions.
>
> I am not really sure I follow you. Are you talking about cpuset
> nodemask changes or memory hotplug here.
>

Actually both - slightly different context.

If the weights are implemented in mempolicy, if the cpuset nodemask
changes then the mempolicy nodemask changes with it.

If the node is removed from the system, I believe (need to validate
this, but IIRC) the node will be removed from any registered cpusets.
As a result, that falls down to mempolicy, and the node is removed.

Not entirely sure what happens if a node is added. The only case where
I think that is relevant is when cpuset is empty ("all") and mempolicy
is set to something like `--interleave=all`. In this case, it's
possible that the new node will simply have a default weight (1), and if
weights are implemented in mempolicy only there is no recourse for changing
it.

> > It was recommended to explore placing it in tiers instead, so I took a
> > crack at it here:
> >
> > https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@xxxxxxxxxxxx/
> >
> > This had similar issue with the idea of hotplug nodes: if you give a
> > tier a weight, and one or more of the nodes goes away/comes back... what
> > should you do with the weight? Split it up among the remaining nodes?
> > Rebalance? Etc.
>
> How is this any different from node becoming depleted? You cannot
> really expect that you get memory you are asking for and you can easily
> end up getting memory from a different node instead.
>
... snip ...
> Maybe I am missing something really crucial here but I do not see how
> this fundamentally changes anything.
>
> Memory hotremove
... snip ...
> Memory hotadd
... snip ...
> But, that requires that interleave policy nodemask is assuming future
> nodes going online and put them to the mask.
>

The difference is the nodemask changes in mempolicy and cpuset. If a
node is removed entirely from the nodemask, and then it comes back
(through cpuset or something), then "what do you do with it"?

If memory is depleted but opens up later - the interleave policy starts
working as intended again. If a node disappears and comes back... that
bit of plumbing is a bit more complex.

So yes, the "assuming future nodes going online and put them into the
mask" is the concern I have. A node being added/removed from the
nodemask specifically different plumbing issues than just depletion.

If that's really not a concern and we're happy to just let it be OBO
until an actual use case for handling node hotplug for weighting, then
mempolicy-based-weighting alone seems more than sufficient.

> > I am not against implementing it in mempolicy (as proof: my first RFC).
> > I am simply searching for the acceptable way to implement it.
> >
> > One of the benefits of having it set as a global setting is that weights
> > can be automatically generated from HMAT/HMEM information (ACPI tables)
> > and programs already using MPOL_INTERLEAVE will have a direct benefit.
>
> Right. This is understood. My main concern is whether this is outweights
> the limitations of having a _global_ policy _only_. Historically a single
> global policy usually led to finding ways how to make that more scoped
> (usually through cgroups).
>

Maybe the answer here is put it in cgroups + mempolicy, and don't handle
hotplug? This is an easy shift my this patch to cgroups, and then
pulling my syscall patch forward to add weights directly to mempolicy.

I think the interleave code stays pretty much the same, the only
difference would be where the task gets the weight from:

if (policy->mode == WEIGHTED_INTERLEAVE)
weight = pol->weight[target_node]
else
cgroups.get_weight(from_node, target_node)

~Gregory