Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

From: Josh Don
Date: Tue Nov 01 2022 - 16:56:48 EST


On Tue, Nov 1, 2022 at 12:15 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Tue, Nov 01, 2022 at 12:11:30PM -0700, Josh Don wrote:
> > > Just to better understand the situation, can you give some more details on
> > > the scenarios where cgroup_mutex was in the middle of a shitshow?
> >
> > There have been a couple, I think one of the main ones has been writes
> > to cgroup.procs. cpuset modifications also show up since there's a
> > mutex there.
>
> If you can, I'd really like to learn more about the details. We've had some
> issues with the threadgroup_rwsem because it's such a big hammer but not
> necessarily with cgroup_mutex because they are only used in maintenance
> operations and never from any hot paths.
>
> Regarding threadgroup_rwsem, w/ CLONE_INTO_CGROUP (userspace support is
> still missing unfortunately), the usual worfklow of creating a cgroup,
> seeding it with a process and then later shutting it down doesn't involve
> threadgroup_rwsem at all, so most of the problems should go away in the
> hopefully near future.

Maybe walking through an example would be helpful? I don't know if
there's anything super specific. For cgroup_mutex for example, the
same global mutex is being taken for things like cgroup mkdir and
cgroup proc attach, regardless of which part of the hierarchy is being
modified. So, we end up sharing that mutex between random job threads
(ie. that may be manipulating their own cgroup sub-hierarchy), and
control plane threads, which are attempting to manage root-level
cgroups. Bad things happen when the cgroup_mutex (or similar) is held
by a random thread which blocks and is of low scheduling priority,
since when it wakes back up it may take quite a while for it to run
again (whether that low priority be due to CFS bandwidth, sched_idle,
or even just O(hundreds) of threads on a cpu). Starving out the
control plane causes us significant issues, since that affects machine
health. cgroup manipulation is not a hot path operation, but the
control plane tends to hit it fairly often, and so those things
combine at our scale to produce this rare problem.

>
> Thanks.
>
> --
> tejun