Re: [RFC PATCH] mm: memcontrol: memory+swap accounting for cgroup-v2

From: Shakeel Butt
Date: Tue Dec 19 2017 - 17:39:30 EST


On Tue, Dec 19, 2017 at 1:41 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello,
>
> On Tue, Dec 19, 2017 at 10:25:12AM -0800, Shakeel Butt wrote:
>> Making the runtime environment, an invariant is very critical to make
>> the management of a job easier whose instances run on different
>> clusters across the world. Some clusters might have different type of
>> swaps installed while some might not have one at all and the
>> availability of the swap can be dynamic (i.e. swap medium outage).
>>
>> So, if users want to run multiple instances of a job across multiple
>> clusters, they should be able to specify the limits of their jobs
>> irrespective of the knowledge of cluster. The best case would be they
>> just submits their jobs without any config and the system figures out
>> the right limit and enforce that. And to figure out the right limit
>> and enforcing it, the consistent memory usage history and consistent
>> memory limit enforcement is very critical.
>
> I'm having a hard time extracting anything concrete from your
> explanation on why memsw is required. Can you please ELI5 with some
> examples?
>

Suppose a user wants to run multiple instances of a specific job on
different datacenters and s/he has budget of 100MiB for each instance.
The instances are schduled on the requested datacenters and the
scheduler has set the memory limit of those instances to 100MiB. Now,
some datacenters have swap deployed, so, there, let's say, the swap
limit of those instances are set according to swap medium
availability. In this setting the user will see inconsistent memcg OOM
behavior. Some of the instances see OOMs at 100MiB usage (suppose only
anon memory) while some will see OOMs way above 100MiB due to swap.
So, the user is required to know the internal knowledge of datacenters
(like which has swap or not and swap type) and has to set the limits
accordingly and thus increase the chance of config bugs.

Also different types and sizes of swap mediums in data center will
further complicates the configuration. One datacenter might have SSD
as a swap, another might be doing swap on zram and third might be
doing swap on nvdimm. Each can have different size and can be assigned
to jobs differently. So, it is possible that the instances of the same
job might be assigned different swap limit on different datacenters.