[PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

From: John Stultz
Date: Mon Nov 06 2023 - 14:35:43 EST


Stabilizing this Proxy Execution series has unfortunately
continued to be a challenging task. Since the v5 release, I’ve
been focused on getting the deactivated/sleeping owner enqueuing
functionality, which I left out of v5, stabilized. I’ve managed
to rework enough to avoid the crashes previously tripped with the
locktorture & ww_mutex selftests, so I feel it’s much improved,
but I do still see some issues (blocked waitqueues and hung task
watchdogs firing) after stressing the system for many hours in a
64 cpu qemu environment (though this issue seems to be introduced
earlier in the series with proxy-migration/return-migration).

I still haven’t had time to focus on testing and debugging the
chain migration patches. So I’ve left that patch out of this
submission for now, but will try to get it included in the next
revision.

This patch series is actually coarser than what I’ve been
developing with, as there are a number of small “test” steps to
help validate behavior I changed, which would then be replaced by
the real logic afterwards. Including those here would just cause
more work for reviewers, so I’ve folded them together, but if
you’re interested you can find the fine-grained tree here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained

As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[1] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Overview:
—----------
Proxy Execution is a generalized form of priority inheritance.
Classic priority inheritance works well for real-time tasks where
there is a straight forward priority order to how things are run.
But it breaks down when used between CFS or DEADLINE tasks, as
there are lots of parameters involved outside of just the task’s
nice value when selecting the next task to run (via
pick_next_task()). So ideally we want to imbue the mutex holder
with all the scheduler attributes of the blocked waiting task.

Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked
on, and which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so
that we instead run the mutex holder.

By leaving blocked tasks on the runqueue, we allow
pick_next_task() to choose the task that should run next (even if
it’s blocked waiting on a mutex). If we do select a blocked task,
we look at the task’s blocked_on mutex and from there look at the
mutex’s owner task. And in the simple case, the task which owns
the mutex is what we then choose to run, allowing it to release
the mutex.

This means that instead of just tracking “curr”, the scheduler
needs to track both the scheduler context (what was picked and
all the state used for scheduling decisions), and the execution
context (what we’re actually running).

In this way, the mutex owner is run “on behalf” of the blocked
task that was picked to run, essentially inheriting the scheduler
context of the waiting blocked task.

As Connor outlined in a previous submission of this patch series,
this raises a number of complicated situations: The mutex owner
might itself be blocked on another mutex, or it could be
sleeping, running on a different CPU, in the process of migrating
between CPUs, etc.

The complexity from this is imposing, but currently in Android we
have a large number of cases where we are seeing priority
inversion (not unbounded, but much longer than we’d like) between
“foreground” and “background” SCHED_NORMAL applications. As a
result, currently we cannot usefully limit background activity
without introducing inconsistent behavior. So Proxy Execution is
a very attractive approach to resolving this critical issue.


New in v6:
---------
* Big rework of blocked owner enqueuing, re-adding the
functionality to the series.

* Added a boot option proxy_exec= and additional logic to allow
the functionality to be boot time toggled.

* Minor tweaks to wake_q logic suggested by Waiman Long

* Minor fixups Reported-by: kernel test robot <lkp@xxxxxxxxx>

* Various cleanups, optimizations as well as fixes for bugs found in v5

Also, I know Peter didn’t like the blocked_on wrappers, so I
previously dropped them, but I found them (and the debug checks
within) crucial to working out issues in this patch series. I’m
fine to consider dropping them later if they are still
objectionable, but their utility was too high at this point to
get rid of them.


Issues still to address:
—----------
* As already mentioned, the sleeping owner handling (where we
deactivate waiting tasks and enqueue them onto a list, then
reactivate them when the owner wakes up) has been very
difficult to get right. This is in part because when we want
to activate tasks, we’re already holding a task.pi_lock and a
rq_lock, just not the locks for the task we’re activating, nor
the rq we’re enqueuing it onto. So there has to be a bit of
lock juggling to drop and acquire the right locks (in the right
order).

* Similarly as we’re often dealing with lists of tasks or chains
of tasks and mutexes, iterating across these chains of objects
can be done safely while holding the rq lock, but as these
chains can cross runqueues our ability to traverse the chains
safely is somewhat limited.

* Additionally, in the sleeping owner enqueueing as well as in
proxy return migration, we seem to be constantly working
against the proper locking order (task.pi_lock->rq lock
->mutex.waitlock -> task.blocked_lock), having to repeatedly
“swim upstream” where we need a higher order lock then what
we’re already holding. Suggestions for different approaches to
the serialization or ways to move the logic outside of where
locks are already held would be appreciated.

* Still need to validate and re-add the chain migration patches
to ensure they resolve the RT/DL load balancing invariants.

* “rq_selected()” naming. Peter doesn’t like it, but I’ve not
thought of a better name. Open to suggestions.

* As discussed at OSPM[2], I like to split pick_next_task() up
into two phases selecting and setting the next tasks, as
currently pick_next_task() assumes the returned task will be
run which results in various side-effects in sched class logic
when it’s run. I tried to take a pass at this earlier, but it’s
hairy and lower on the priority list for now.

* CFS load balancing. Blocked tasks may carry forward load (PELT)
to the lock owner's CPU, so CPU may look like it is overloaded.

* Optimization to avoid migrating blocked tasks (to preserve
optimistic mutex spinning) if the runnable lock-owner at the
end of the blocked_on chain is already running (though this is
difficult due to the limitations from locking rules needed to
traverse the blocked on chain across run queues).


Performance:
—----------
This patch series switches mutexes to use handoff mode rather
than optimistic spinning. This is a potential concern where locks
are under high contention. However, earlier performance analysis
(on both x86 and mobile devices) did not see major regressions.
That said, Chenyu did report a regression[3], which I’ll need to
look further into. I also briefly re-tested with this v5 series
and saw some average latencies grow vs v4, suggesting the changes
to return-migration (and extra locking) have some impact. With v6
the extra overhead is reduced but still not as nice as v4. I’ll
be digging more there, but my priority is still stability over
speed at this point (it’s easier to validate correctness of
optimizations if the baseline isn’t crashing).


If folks find it easier to test/tinker with, this patch series
can also be found here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6

Awhile back Sage Sharp had a nice blog post about types of
reviews [4], and while any review and feedback would be greatly
appreciated, those focusing on conceptual design or correctness
issues would be *especially* valued.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
[4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/


Cc: Joel Fernandes <joelaf@xxxxxxxxxx>
Cc: Qais Yousef <qyousef@xxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Zimuzo Ezeozue <zezeozue@xxxxxxxxxx>
Cc: Youssef Esmat <youssefesmat@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Will Deacon <will@xxxxxxxxxx>
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
Cc: "Paul E . McKenney" <paulmck@xxxxxxxxxx>
Cc: kernel-team@xxxxxxxxxxx


John Stultz (8):
locking/mutex: Removes wakeups from under mutex::wait_lock
sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable
locking/mutex: Split blocked_on logic into two states (blocked_on and
blocked_on_waking)
sched: Fix runtime accounting w/ split exec & sched contexts
sched: Split out __sched() deactivate task logic into a helper
sched: Add a very simple proxy() function
sched: Add proxy deactivate helper
sched: Handle blocked-waiter migration (and return migration)

Juri Lelli (2):
locking/mutex: make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()

Peter Zijlstra (8):
sched: Unify runtime accounting across classes
locking/mutex: Rework task_struct::blocked_on
locking/mutex: Add task_struct::blocked_lock to serialize changes to
the blocked_on state
locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC
sched: Split scheduler execution context
sched: Start blocked_on chain processing in proxy()
sched: Add blocked_donor link to task for smarter mutex handoffs
sched: Add deactivated (sleeping) owner handling to proxy()

Valentin Schneider (2):
locking/mutex: Add p->blocked_on wrappers for correctness checks
sched: Fix proxy/current (push,pull)ability

.../admin-guide/kernel-parameters.txt | 4 +
include/linux/sched.h | 49 +-
init/Kconfig | 7 +
init/init_task.c | 1 +
kernel/Kconfig.locks | 2 +-
kernel/fork.c | 8 +-
kernel/locking/mutex-debug.c | 9 +-
kernel/locking/mutex.c | 163 ++--
kernel/locking/mutex.h | 25 +
kernel/locking/ww_mutex.h | 72 +-
kernel/sched/core.c | 723 ++++++++++++++++--
kernel/sched/deadline.c | 50 +-
kernel/sched/fair.c | 104 ++-
kernel/sched/rt.c | 77 +-
kernel/sched/sched.h | 61 +-
kernel/sched/stop_task.c | 13 +-
16 files changed, 1117 insertions(+), 251 deletions(-)

--
2.42.0.869.gea05f2083d-goog