Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

From: John Stultz
Date: Wed Nov 08 2023 - 17:13:53 EST


On Wed, Nov 8, 2023 at 3:40 AM Hillf Danton <hdanton@xxxxxxxx> wrote:
>
> On Mon, 6 Nov 2023 19:34:43 +0000 John Stultz <jstultz@xxxxxxxxxx>
> > Overview:
> > —----------
> > Proxy Execution is a generalized form of priority inheritance.
> > Classic priority inheritance works well for real-time tasks where
> > there is a straight forward priority order to how things are run.
> > But it breaks down when used between CFS or DEADLINE tasks, as
> > there are lots of parameters involved outside of just the task’s
> > nice value when selecting the next task to run (via
> > pick_next_task()). So ideally we want to imbue the mutex holder
> > with all the scheduler attributes of the blocked waiting task.
>
> [...]

Is there a reason why you trimmed the cc list?

> > The complexity from this is imposing, but currently in Android we
> > have a large number of cases where we are seeing priority
> > inversion (not unbounded, but much longer than we’d like) between
> > “foreground” and “background” SCHED_NORMAL applications. As a
> > result, currently we cannot usefully limit background activity
> > without introducing inconsistent behavior. So Proxy Execution is
> > a very attractive approach to resolving this critical issue.
>
> Given usual mutex use
>
> mutex_lock;
> less-than-a-tick level critical section;
> (unusual case for example: sleep until wakeup;)
> mutex_unlock;

So the case we see regularly is you have a low priority task, which
maybe is cpuset restricted onto a smaller more energy efficient cpu,
and cpu share limited as well so it only gets a small proportional
fraction of time on that little cpu.

Alternatively, you could also imagine it being a SCHED_IDLE task on a
busy system, where every once in a while the system is momentarily
idle allowing the task to briefly run.

Either way, it doesn't get to run very much, but when it does, it
calls into the kernel on a path that is serialized with a mutex. Once
it takes the mutex even if it were to hold it for a short time(as in
your example above), if it gets preempted while holding the mutex, it
won't be selected to run for a while. Then when an important task
calls into a similar kernel path, it will have to block and sleep
waiting for that mutex to be released. Unfortunately, because there
may be enough work going on in other tasks to keep the cpus busy, the
low priority task doesn't get scheduled so it cannot release the lock.
Given it is cpu share limited (or is SCHED_IDLE), depending on the
load it may not get scheduled for a relatively long time. We've
definitely seen traces where the outlier latencies are in the seconds.

> I think the effects of priority inversion could be safely ignored
> without sleep (because of page reclaim for instance) in the critical
> section.

I'm not sure I understand this assertion, could you clarify?

If it's helpful, I've got a simple (contrived) demo which can
reproduce a similar issue I've described above, using just file
renames as the mutex protected critical section.
https://github.com/johnstultz-work/priority-inversion-demo

> Could you please elaborate a bit on the priority inversion and tip
> point to one or two cases of mutex behind the inversion?

In Andorid, we see it commonly with various binder mutexes or pstore
write_pmsg() pmsg_lock, as those are commonly taken by most apps.
But any common kernel path that takes a mutex can cause these large
latency outliers if we're trying to restrict background processes.

thanks
-john