Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

From: John Stultz
Date: Wed Dec 13 2023 - 20:00:30 EST


On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
> I too see this as the most notable regression. Some of the other
> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
> DeathStarBench) show little to no difference when running with Proxy
> Execution, however sched-messaging sees a 10x blowup in the runtime.
> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)
...
> The trend I see with hackbench is that the chain migration leads
> to a single runqueue being completely overloaded, followed by some
> amount of the idling on the entire CCX and a similar chain appearing
> on a different CPU. The trace for tip show a lot more CPUs being
> utilized.

So I reproduced a similar issue with the test (I only have 8 cores on
the bare metal box I have so I didn't bother using taskset):

perf bench sched messaging -p -t -l 100000 -g 1
v6.6: 4.583s
proxy-exec-6.6: 28.842s
proxy-exec-WIP: 26.1033s

So the pre-v7 code does improve things, but not substantially.

Bisecting through the patch series to see how it regressed shows it is
not a single change:
mutex-handoff: 16.957s
blocked-waiter/return: 29.628s
ttwu return: 20.045s
blocked_donor: 25.021s

So it seems like the majority of the regression comes from switching
to mutex handoff instead of optimistic spinning.
This would account for your more cpus being utilized comment, as more
of the blocked tasks would be spinning trying to take the mutex.

Then adding the initial blocked-waiter/return migration change hurts
things further (and this was a known issue with v5/v6).
Then the pending patch to switch back to doing return migration in
ttwu recovers a chunk of that cost.
And then the blocked_donor handoff optimization (which passes the lock
back to the donor boosting us, rather than the next task in the mutex
waitlist) further impacts performance here.

The chain migration feature in proxy-exec-WIP doesn't seem to help or
hurt in this case.

I'll need to look closer at each of those steps to see if there's
anything too daft I'm doing.

The loss of optimistic spinning has been a long term worry with the
patch. Unfortunately as I mentioned in the plumbers talk, the idea
from OSPM on avoiding migration and spinning if the runnable owner of
a blocked_on chain was on a cpu isn't easy to accomplish, as we are
limited to how far off the rq we can look. I feel like I need to come
up with some way to lock the entire blocked_on chain so it can be
traversed safely - as it is now, due to the locking order
(task->pi_lock, rq_lock, mutex->wait_lock, task->blocked_on) we
can't stabily traverse go task->mutex->task->mutex, unless the tasks
are all on the same rq (and we're holding the rq_lock). So we can
only safely look at one mutex owner off of the rq (when we also hold
the mutex wait_lock).

I'm stewing on an idea for some sort of mutex holder (similar to a
runqueue) that would allow the chain of mutexes to be traversed
quickly and stability - but the many-to-one blocked_on relationship
complicates it.
Suggestions or other ideas here would be welcome, as I didn't get a
chance to really discuss it much at plumbers.

thanks
-john