Re: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

From: John Stultz
Date: Tue Jun 13 2023 - 17:31:25 EST


On Tue, Jun 13, 2023 at 10:36 AM Dietmar Eggemann
<dietmar.eggemann@xxxxxxx> wrote:
>
> On 01/06/2023 07:58, John Stultz wrote:
> > After having to catch up on other work after OSPM[1], I've finally
> > gotten back to focusing on Proxy Execution and wanted to send out this
> > next iteration of the patch series for review, testing, and feedback.
> > (Many thanks to folks who provided feedback on the last revision!)
> >
> > As mentioned previously, this Proxy Execution series has a long history:
> > First described in a paper[2] by Watkins, Straub, Niehaus, then from
> > patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
> > Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
> > for providing additional details here!)
> >
> > So again, many thanks to those above, as all the credit for this series
> > really is due to them - while the mistakes are likely mine.
> >
> > Overview:
> > —----------
> > Proxy Execution is a generalized form of priority inheritance. Classic
> > priority inheritance works well for real-time tasks where there is a
> > straight forward priority order to how things are run. But it breaks
> > down when used between CFS or DEADLINE tasks, as there are lots
> > of parameters involved outside of just the task’s nice value when
> > selecting the next task to run (via pick_next_task()). So ideally we
> > want to imbue the mutex holder with all the scheduler attributes of
> > the blocked waiting task.
> >
> > Proxy Execution does this via a few changes:
> > * Keeping tasks that are blocked on a mutex *on* the runqueue
> > * Keeping additional tracking of which mutex a task is blocked on, and
> > which task holds a specific mutex.
> > * Special handling for when we select a blocked task to run, so that we
> > instead run the mutex holder.
> >
> > The first of these is the most difficult to grasp (I do get the mental
> > friction here: blocked tasks on the *run*queue sounds like nonsense!
> > Personally I like to think of the runqueue in this model more like a
> > “task-selection queue”).
> >
> > By leaving blocked tasks on the runqueue, we allow pick_next_task() to
> > choose the task that should run next (even if it’s blocked waiting on a
> > mutex). If we do select a blocked task, we look at the task’s blocked_on
> > mutex and from there look at the mutex’s owner task. And in the simple
> > case, the task which owns the mutex is what we then choose to run,
> > allowing it to release the mutex.
> >
> > This means that instead of just tracking “curr”, the scheduler needs to
> > track both the scheduler context (what was picked and all the state used
> > for scheduling decisions), and the execution context (what we’re
> > running)
> >
> > In this way, the mutex owner is run “on behalf” of the blocked task
> > that was picked to run, essentially inheriting the scheduler context of
> > the blocked task.
> >
> > As Connor outlined in a previous submission of this patch series, this
> > raises a number of complicated situations: The mutex owner might itself
> > be blocked on another mutex, or it could be sleeping, running on a
> > different CPU, in the process of migrating between CPUs, etc.
> >
> > But the functionality provided by Proxy Execution is useful, as in
> > Android we have a number of cases where we are seeing priority inversion
> > (not unbounded, but longer than we’d like) between “foreground” and
> > “background” SCHED_NORMAL applications, so having a generalized solution
> > would be very useful.
> >
> > New in v4:
> > —------
> > * Fixed deadlock that was caused by wait/wound mutexes having circular
> > blocked_on references by clearing the blocked_on pointer on the task
> > we are waking to wound/die.
>
> I always get this when running `insmod ./test-ww_mutex.ko` with default
> SCHED_FEAT(TTWU_QUEUE, true) with this fix. Don't understand the issue
> fully yet:
>
> qemu-system-x86_64 ... -smp cores=64 -enable-kvm ...
>
> [ 21.109134] Beginning ww mutex selftests
> [ 26.397545] ------------[ cut here ]------------
> [ 26.397951] WARNING: CPU: 41 PID: 0 at kernel/sched/core.c:4126 sched_ttwu_pending+0xc5/0x120

Thanks for testing it out and sharing this!

Yeah. I've seen this as well, and I suspect this is tied to the
runqueue accounting issues that are causing the null sched entity
issue I've been chasing.

One issue seems to be the blocked_on manipulation done in the
try_to_wakeup() path. For mutex blocked tasks, try_to_wakeup() needs
to clear blocked_on so the task can actually run and try to acquire
the mutex. This is why __mutex_lock_slowpath_common() sets blocked_on
twice. I'm seeing some trouble where try_to_wakeup() ends up
migrating the still technically blocked task (as it's not yet acquired
the mutex, just waking to try to take it) back to the p->wake_cpu -
but the issue is that the task has not been deactivated from the
current runqueue.

The "fix" for ww_mutex circular dependencies resolution in
__ww_mutex_die() is similar, as it clears the blocked_on value and
tries to wake the task, but then ttwu often migrates the task without
deactivating it from the current rq.

Unfortunately once the rq accounting is gone wrong, the failures seem
multi-modal, so trying to get enough tracing logs in place to get your
head around what's going wrong in one case is hard because run-to-run
it will fail differently.

In my current tree I've extended this so we have a separate
task->blocked_on_waking flag, so we don't mess with the
task->blocked_on state in ttwu & __wwmutex_die and can distinguish
between waking a blocked task and waking an unblocked task. However
that in itself isn't working yet, so I'm currently deconstructing the
patch series a bit to try to understand that logic better and if we
can avoid special casing as much in the ttwu path (I'd like to be able
to take mutex blocked tasks and also just drop them from the runqueue
- as we do now - and have things work so we have fallback options if
proxy() migrations are taking too long to resolve).

thanks
-john