Re: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

From: Dietmar Eggemann
Date: Tue Jun 13 2023 - 13:36:53 EST


On 01/06/2023 07:58, John Stultz wrote:
> After having to catch up on other work after OSPM[1], I've finally
> gotten back to focusing on Proxy Execution and wanted to send out this
> next iteration of the patch series for review, testing, and feedback.
> (Many thanks to folks who provided feedback on the last revision!)
>
> As mentioned previously, this Proxy Execution series has a long history:
> First described in a paper[2] by Watkins, Straub, Niehaus, then from
> patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
> Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
> for providing additional details here!)
>
> So again, many thanks to those above, as all the credit for this series
> really is due to them - while the mistakes are likely mine.
>
> Overview:
> —----------
> Proxy Execution is a generalized form of priority inheritance. Classic
> priority inheritance works well for real-time tasks where there is a
> straight forward priority order to how things are run. But it breaks
> down when used between CFS or DEADLINE tasks, as there are lots
> of parameters involved outside of just the task’s nice value when
> selecting the next task to run (via pick_next_task()). So ideally we
> want to imbue the mutex holder with all the scheduler attributes of
> the blocked waiting task.
>
> Proxy Execution does this via a few changes:
> * Keeping tasks that are blocked on a mutex *on* the runqueue
> * Keeping additional tracking of which mutex a task is blocked on, and
> which task holds a specific mutex.
> * Special handling for when we select a blocked task to run, so that we
> instead run the mutex holder.
>
> The first of these is the most difficult to grasp (I do get the mental
> friction here: blocked tasks on the *run*queue sounds like nonsense!
> Personally I like to think of the runqueue in this model more like a
> “task-selection queue”).
>
> By leaving blocked tasks on the runqueue, we allow pick_next_task() to
> choose the task that should run next (even if it’s blocked waiting on a
> mutex). If we do select a blocked task, we look at the task’s blocked_on
> mutex and from there look at the mutex’s owner task. And in the simple
> case, the task which owns the mutex is what we then choose to run,
> allowing it to release the mutex.
>
> This means that instead of just tracking “curr”, the scheduler needs to
> track both the scheduler context (what was picked and all the state used
> for scheduling decisions), and the execution context (what we’re
> running)
>
> In this way, the mutex owner is run “on behalf” of the blocked task
> that was picked to run, essentially inheriting the scheduler context of
> the blocked task.
>
> As Connor outlined in a previous submission of this patch series, this
> raises a number of complicated situations: The mutex owner might itself
> be blocked on another mutex, or it could be sleeping, running on a
> different CPU, in the process of migrating between CPUs, etc.
>
> But the functionality provided by Proxy Execution is useful, as in
> Android we have a number of cases where we are seeing priority inversion
> (not unbounded, but longer than we’d like) between “foreground” and
> “background” SCHED_NORMAL applications, so having a generalized solution
> would be very useful.
>
> New in v4:
> —------
> * Fixed deadlock that was caused by wait/wound mutexes having circular
> blocked_on references by clearing the blocked_on pointer on the task
> we are waking to wound/die.

I always get this when running `insmod ./test-ww_mutex.ko` with default
SCHED_FEAT(TTWU_QUEUE, true) with this fix. Don't understand the issue
fully yet:

qemu-system-x86_64 ... -smp cores=64 -enable-kvm ...

[ 21.109134] Beginning ww mutex selftests
[ 26.397545] ------------[ cut here ]------------
[ 26.397951] WARNING: CPU: 41 PID: 0 at kernel/sched/core.c:4126 sched_ttwu_pending+0xc5/0x120
[ 26.398590] Modules linked in: test_ww_mutex(+)
[ 26.398916] CPU: 41 PID: 0 Comm: swapper/41 Not tainted 6.4.0-rc1-00054-gb4baf2e792df-dirty #9
[ 26.399506] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 26.400193] RIP: 0010:sched_ttwu_pending+0xc5/0x120
[ 26.400515] Code: c8 75 ba 41 c7 46 48 00 00 00 00 4c 89 f7 e8 32 b5 d4 00 41 f7 c4 00 02 00 00 74 01
fb 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b 44 8b 45 14 8b 8d 20 05 00 00 48
8d 95 18 07 00 00 48 c7 c6
[ 26.401840] RSP: 0018:ffffa31940990fc0 EFLAGS: 00010006
[ 26.402178] RAX: 0000000000000012 RBX: ffffffffffffffc8 RCX: 00000006256a6d58
[ 26.402631] RDX: 000000000001c9f4 RSI: ffff9dc5012fe180 RDI: ffffffff97320a40
[ 26.403096] RBP: ffff9dc50552d140 R08: 00000006256a6d58 R09: 0000000000000029
[ 26.403607] R10: 0000000000000000 R11: ffffa31940990ff8 R12: 0000000000000086
[ 26.404117] R13: ffffffffffffffc8 R14: ffff9dc57d86b3c0 R15: 0000000000000000
[ 26.404691] FS: 0000000000000000(0000) GS:ffff9dc57d840000(0000) knlGS:0000000000000000
[ 26.405236] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 26.405663] CR2: 00007ffeda3d7b00 CR3: 0000000013e2e003 CR4: 0000000000370ee0
[ 26.406236] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 26.406715] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 26.407219] Call Trace:
[ 26.407390] <IRQ>
[ 26.407571] __sysvec_call_function_single+0x28/0xc0
[ 26.407988] sysvec_call_function_single+0x69/0x90
[ 26.408312] </IRQ>
[ 26.408467] <TASK>
[ 26.408612] asm_sysvec_call_function_single+0x1a/0x20
[ 26.408992] RIP: 0010:default_idle+0xf/0x20
[ 26.409267] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
f3 0f 1e fa 66 90 0f 00 2d d3 00 40 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f
84 00 00 00 00 00 90 90 90 90 90
[ 26.410629] RSP: 0018:ffffa319401cbed8 EFLAGS: 00000252
[ 26.411073] RAX: ffff9dc57d867f80 RBX: ffff9dc5012fe180 RCX: 4000000000000000
[ 26.411625] RDX: 0000000000000001 RSI: 0000000000000087 RDI: 00000000000ed25c
[ 26.411788] ------------[ cut here ]------------

extra debug:

sched_ttwu_pending [kworker/u128:87 738] task_cpu(p)=29 cpu_of(rq)=41

[...]