Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

From: tj@xxxxxxxxxx
Date: Fri Nov 10 2023 - 22:05:59 EST

Next message: Randy Dunlap: "Re: [PATCH -next] hwmon: Fix some kernel-doc comments"
Previous message: Randy Dunlap: "[PATCH] MIPS: SGI-IP27: hubio: fix nasid kernel-doc warning"
In reply to: Gregory Price: "Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control"
Next in thread: Gregory Price: "Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello, Gregory.

On Fri, Nov 10, 2023 at 05:29:25PM -0500, Gregory Price wrote:
> I did originally implement it this way, but note that it will either
> require some creative extension of set_mempolicy or even set_mempolicy2
> as proposed here:
>
> https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@xxxxxxxxxxxx/
>
> One of the problems to consider is task migration. If a task is
> migrated from one socket to another, for example by being moved to a new
> cgroup with a different cpuset - the weights might be completely nonsensical
> for the new allowed topology.
>
> Unfortunately mpol has no way of being changed from outside the task
> itself once it's applied, other than changing its nodemasks via cpusets.

Maybe it's time to add one?

> So one concrete use case: kubernetes might like change cpusets or move
> tasks from one cgroup to another, or a vm might be migrated from one set
> of nodes to enother (technically not mutually exclusive here). Some
> memory policy settings (like weights) may no longer apply when this
> happens, so it would be preferable to have a way to change them.

Neither covers all use cases. As you noted in your mempolicy message, if the
application wants finer grained control, cgroup interface isn't great. In
general, any changes which are dynamically initiated by the application
itself isn't a great fit for cgroup.

I'm generally pretty awry of adding non-resource group configuration
interface especially when they don't have counter part in the regular
per-process/thread API for a few reasons:

1. The reason why people try to add those through cgroup somtimes is because
it seems easier to add those new features through cgroup, which may be
true to some degree, but shortcuts often aren't very conducive to long
term maintainability.

2. As noted above, just having cgroup often excludes a signficant portion of
use cases. Not all systems enable cgroups and programatic accesses from
target processes / threads are coarse-grained and can be really awakward.

3. Cgroup can be convenient when group config change is necessary. However,
we really don't want to keep adding kernel interface just for changing
configs for a group of threads. For config changes which aren't high
frequency, userspace iterating the member processes and applying the
changes if possible is usually good enough which usually involves looping
until no new process is found. If the looping is problematic, cgroup
freezer can be used to atomically stop all member threads to provide
atomicity too.

Thanks.

--
tejun

Next message: Randy Dunlap: "Re: [PATCH -next] hwmon: Fix some kernel-doc comments"
Previous message: Randy Dunlap: "[PATCH] MIPS: SGI-IP27: hubio: fix nasid kernel-doc warning"
In reply to: Gregory Price: "Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control"
Next in thread: Gregory Price: "Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]