[PATCH v8 0/7] Preparatory changes for Proxy Execution v8

From: John Stultz
Date: Fri Feb 09 2024 - 19:23:50 EST


After sending out v7 of Proxy Execution, I got feedback that the
patch series was getting a bit unwieldy to review, and Qais
suggested I break out just the cleanups/preparatory components of
the patch series and submit them on their own in the hope we can
start to merge the less complex bits and discussion can focus on
the more complicated portions afterwards.

So for the v8 of this series, I’m only submitting those earlier
cleanup/preparatory changes here. However work on the full series
has continued, with some nice progress on the performance front.

If you are interested, the full v8 series, it can be found here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v8-6.8-rc3
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v8-6.8-rc3

One bit of feedback I got earlier from K Prateek Nayak[1] was
that while the Proxy Execution feature didn’t impact performance
in most benchmarks he tried, it did cause major regressions with
the `perf bench` test, which I isolated down to the patch series
disabling optimistic spinning which has been a long term
performance concern with the change.

I spent some time looking for a way to walk the chain (across
cpus) to see if the runnable owner was actually running, which
would allow us to spin and avoid migrations. However, I realized
that is actually a separate and more difficult problem (ie:
migration avoidance) from simple optimistic spinning (a task
trying to take a mutex where the owner is running). With Proxy
Execution, the main focus is how we effectively boost
*non-running* mutex owners by the blocked tasks. But when the
contended mutex owner is running, we don’t need to do anything
different from the existing optimistic spinning logic.

So I’ve dropped the patches disabling optimistic spinning, and
fixed up some further issues as the change opened some new races
the later Proxy Execution patches weren’t expecting. The results
are looking really good with the perf bench test, which
previously regressed so badly, now seeing very close results
(within 1.5%) with and without Proxy Execution. This alleviates
a major concern I had with the series.

Re-enabling optimistic spinning does open a question about if we
should allow the optimistic spinning’s lock-stealing (instead of
lock-handoff) when the runnable mutex owner is running as a proxy
for a more important task. Ideally we probably want to hand it
back up the chain which has donated the time (to avoid migrating
the chain to the stealer’s cpu), so I have a patch I’m testing
that special cases that situation, but so far in my testing I’m
not seeing any large increase in latencies from the potential
unfairness of lock-stealing. But I’ll be doing more investigation
on this.

New in v8:
---------
* Re-ordered patch set to only submit cleanup & preparatory
changes, leaving more complicated changes for a future
submission (suggested by Qais Yousef)

* Lots of spelling fixups and minor changes suggested by Randy
Dunlap and Metin Kaya

* Cover letter was getting long winded by including boilerplate
sections, so please see earlier versions of the patch series or
this LWN article [2] for an overview of the functionality.

* Re-enabled mutex optimistic spinning to resolve a major
performance regression in earlier versions of the series

* Fixups for races discovered with optimistic spinning

* Included trace events patch from Metin to better help us
analyze the costs of proxying.

* Extended sched_football to use rtmtuexes when
!CONFIG_SCHED_PROXY_EXEC

* Tweaked sched_football to bail if spawned players don’t check
in within 30s

* Chased down a bug uncovered by sched_football that was
preventing unconstrained RT tasks from being put on the
pushable list, causing improper task starvation and
sched_football failures.

Performance:
---------
With the optimistic spinning re-enabled, the biggest concern I
had wrt performance is much relieved. However, there is still the
potential extra overhead of doing rq migrations/return migrations
for the proxy case. The big hypothesis of this series is that the
overall benefit from unblocking important tasks will well
outweigh any extra cost, but if folks see anything of concern,
I’d love to hear about it.

Issues still to address:
---------
* Enabling optimistic spinning uncovered a few interesting races
that were much harder to hit in v7. I’ve resolved most of
these, but there is a case I’ve occasionally seen where the in
try_to_wakeup(), the blocked_on_state transitions are too lax,
causing a transition to BO_RUNNABLE when we have not had a
chance to evaluate for return-migration (allowing tasks to
potentially run on cpus they shouldn’t). Unfortunately in
tightening the state handling to be more careful, I’ve run into
cases where we don’t transition to BO_RUNNABLE, and effectively
lose the wakeup. I’ve started to suspect the conceptual overlap
between task->blocked_on_state and task->__state suggests I
should consider condensing the BO_BLOCKED/BO_WAKING states into
new task states (ie: TASK_MUTEX_BLOCKED, TASK_PROXY_MIGRATED),
so we can handle the state transitions more correctly.

* Closer evaluation of pros/cons of doing lock handoff vs
allowing lock-stealing when proxy tasks release locks.

* The chain migration functionality needs further iterations and
better validation to ensure it truly maintains the RT/DL load
balancing invariants.

* Xuewen Yan earlier pointed out that we may see task
mis-placement on EAS systems if we do return migration based
strictly on cpu allowability. I tried an optimization to always
try to return migrate to the wake_cpu (which was saved on
proxy-migration), but this seemed to undo a chunk of the
benefit I saw in moving return migration back to ttwu, at least
with my prio-inversion-demo microbenchmark. Need to do some
broader performance analysis with the variations there.

* It would be nice to find optimization to avoid migrating
blocked tasks if the runnable lock-owner at the end of the
blocked_on chain is already running (though this is difficult
due to the limitations from locking rules needed to traverse
the blocked on chain across run queues).

* CFS load balancing. There was concern blocked tasks may carry
forward load (PELT) to the lock owner's CPU, so CPU may look
like it is overloaded. Needs investigation

* The sleeping owner handling (where we deactivate waiting tasks
and enqueue them onto a list, then reactivate them when the
owner wakes up) doesn’t feel great. This is in part because
when we want to activate tasks, we’re already holding a
task.pi_lock and a rq_lock, just not the locks for the task
we’re activating, nor the rq we’re enqueuing it onto. So there
has to be a bit of lock juggling to drop and acquire the right
locks (in the right order). It feels like there’s got to be a
better way.

* As discussed at OSPM[3], I’d like to split pick_next_task() up
into two phases selecting and setting the next tasks, as
currently pick_next_task() assumes the returned task will be
run which results in various side-effects in sched class logic
when it’s run. I tried to take a pass at this earlier, but it’s
hairy and lower on the priority list for now.

Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[4] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

As always, feedback is always appreciated.

Thanks so much!
-john

[1] https://lore.kernel.org/lkml/c6787831-f659-12cb-4954-fd13a05ed590@xxxxxxx/
[2] https://lwn.net/Articles/934114/
[3] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[4] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelaf@xxxxxxxxxx>
Cc: Qais Yousef <qyousef@xxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Zimuzo Ezeozue <zezeozue@xxxxxxxxxx>
Cc: Youssef Esmat <youssefesmat@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Will Deacon <will@xxxxxxxxxx>
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
Cc: Metin Kaya <Metin.Kaya@xxxxxxx>
Cc: Xuewen Yan <xuewen.yan94@xxxxxxxxx>
Cc: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: kernel-team@xxxxxxxxxxx

Connor O'Brien (2):
sched: Add do_push_task helper
sched: Consolidate pick_*_task to task_is_pushable helper

John Stultz (2):
locking/mutex: Remove wakeups from under mutex::wait_lock
sched: Split out __schedule() deactivate task logic into a helper

Juri Lelli (2):
locking/mutex: Make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()

Peter Zijlstra (1):
sched: Split scheduler and execution contexts

kernel/locking/mutex.c | 60 +++++++----------
kernel/locking/mutex.h | 25 +++++++
kernel/locking/rtmutex.c | 26 +++++---
kernel/locking/rwbase_rt.c | 4 +-
kernel/locking/rwsem.c | 4 +-
kernel/locking/spinlock_rt.c | 3 +-
kernel/locking/ww_mutex.h | 49 ++++++++------
kernel/sched/core.c | 122 +++++++++++++++++++++--------------
kernel/sched/deadline.c | 53 ++++++---------
kernel/sched/fair.c | 18 +++---
kernel/sched/rt.c | 59 +++++++----------
kernel/sched/sched.h | 44 ++++++++++++-
12 files changed, 268 insertions(+), 199 deletions(-)

--
2.43.0.687.g38aa6559b0-goog