Re: [REGRESSION] Re: [PATCH 6.1 033/219] memcg: drop kmem.limit_in_bytes

From: Michal Hocko
Date: Wed Sep 20 2023 - 07:07:18 EST


On Wed 20-09-23 12:04:48, Jeremi Piotrowski wrote:
> On 9/20/2023 10:43 AM, Michal Hocko wrote:
> > On Wed 20-09-23 01:11:01, Jeremi Piotrowski wrote:
> >> On Sun, Sep 17, 2023 at 09:12:40PM +0200, Greg Kroah-Hartman wrote:
> >>> 6.1-stable review patch. If anyone has any objections, please let me know.
> >>>
> >>> ------------------
> >>
> >> Hi Greg/Michal,
> >>
> >> This commit breaks userspace which makes it a bad commit for mainline and an
> >> even worse commit for stable.
> >>
> >> We ingested 6.1.54 into our nightly testing and found that runc fails to gather
> >> cgroup statistics (when reading kmem.limit_in_bytes). The same code is vendored
> >> into kubelet and kubelet fails to start if this operation fails. 6.1.53 is
> >> fine.
> >
> > Could you expand some more on why is the file read? It doesn't support
> > writing to it for some time so how does reading it helps in any sense?
> >
> > Anyway, I do agree that the stable backport should be reverted.
> >
>
> This file is read together with all the other memcg files. Each prefix:
>
> memory
> memory.memsw
> memory.kmem
> memory.kmem.tcp
>
> is combined with these suffixes
>
> .usage_in_bytes
> .max_usage_in_bytes
> .failcnt
> .limit_in_bytes
>
> and read, the values are then forwarded on to other components for scheduling decisions.
> You want to know the limit when checking the usage (is the usage close to the limit or not).

You know there is no kmem limit as there is no way to set it for some
time (since 5.16 - i.e. 2 years ago). I can see that users following old
kernels could have missed that though.

> Userspace tolerates MEMCG/MEMCG_KMEM being disabled, but having a single file out of the
> set missing is an anomaly. So maybe we could keep the dummy file just for the
> sake of consistency? Cgroupv1 is legacy after all.

What we had was a dummy file. It didn't allow to write any value so it
would have always reported unlimited. The reason I've decided to remove
the file was that there were other users not being able to handle the
write failure while they are just fine not having the file. So we are
effectively between a rock and hard place here. Either way something is
broken. The other SW got fixed as well but similar to your case it takes
some time to absorb the change through all 3rd party users.

> >>> Address this by wiping out the file completely and effectively get back to
> >>> pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
> >>
> >> On reads, the runc code checks for MEMCG_KMEM=n by checking
> >> kmem.usage_in_bytes.

Just one side note. Config options get renamed and their semantic
changes over time so I would just recomment to never make any
dependencies on any specific one.

> >> If it is present then runc expects the other cgroup files
> >> to be there (including kmem.limit_in_bytes). So this change is not effectively
> >> the same.
> >>
> >> Here's a link to the PR that would be needed to handle this change in userspace
> >> (not merged yet and would need to be propagated through the ecosystem):
> >>
> >> https://github.com/opencontainers/runc/pull/4018.
> >
> > Thanks. Does that mean the revert is still necessary for the Linus tree
> > or do you expect that the fix can be merged and propagated in a
> > reasonable time?
> >
>
> We can probably get runc and currently supported kubernetes versions patched in time
> before 6.6 (or the next LTS kernel) hits LTS distros.
>
> But there's still a bunch of users running cgroupv1 with unsupported kubernetes
> versions that are still taking kernel updates as they come, so this might get reported
> again next year if it stays in mainline.

I can see how 3rd party users are hard to get aligned but having a fix
available should allow them to apply it or is there any actual roadblock
for them to adapt as soon as they hit the issue?

I mean, normally I would be just fine reverting this API change because
it is disruptive but the only way to have the file available and not
break somebody is to revert 58056f77502f ("memcg, kmem: further
deprecate kmem.limit_in_bytes") as well. Or to ignore any value written
there but that sounds rather dubious. Although one could argue this
would mimic nokmem kernel option.

--
Michal Hocko
SUSE Labs