Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

From: John Stultz
Date: Wed Dec 13 2023 - 14:11:54 EST

Next message: Kees Cook: "Re: [PATCH 2/6] wifi: ath10k: use flexible arrays for WMI start scan TLVs"
Previous message: Greg Kroah-Hartman: "Re: Linux 6.6.7"
In reply to: Metin Kaya: "Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6"
Next in thread: K Prateek Nayak: "Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>
> Hello John,
>
> I may have some data that might help you debug a potential performance
> issue mentioned below.

Hey Prateek,
Thank you so much for taking the time to try this out and providing
such helpful analysis!
More below.

> On 11/7/2023 1:04 AM, John Stultz wrote:
> > [..snip..]
> >
> > Performance:
> > —----------
> > This patch series switches mutexes to use handoff mode rather
> > than optimistic spinning. This is a potential concern where locks
> > are under high contention. However, earlier performance analysis
> > (on both x86 and mobile devices) did not see major regressions.
> > That said, Chenyu did report a regression[3], which I’ll need to
> > look further into.
>
> I too see this as the most notable regression. Some of the other
> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
> DeathStarBench) show little to no difference when running with Proxy

This is great to hear! Thank you for providing this input!

> Execution, however sched-messaging sees a 10x blowup in the runtime.
> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)

Oof. I appreciate you sharing this!

> While investigating, I drew up the runqueue length when running
> sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
> EPYC system) using the following bpftrace script that dumps it csv
> format:

Just so I'm following you properly on the processor you're using, cpus
0-7 and 128-125 are in the same CCX?
(I thought there were only 8 cores per CCX?)

> rqlen.bt
> ---
<snip>
> --
>
> I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy
> execution (rqlen50-pe-pinned.csv) below.
>
> The trend I see with hackbench is that the chain migration leads
> to a single runqueue being completely overloaded, followed by some
> amount of the idling on the entire CCX and a similar chain appearing
> on a different CPU. The trace for tip show a lot more CPUs being
> utilized.
>
> Mathieu has been looking at hackbench and the effect of task migration
> on the runtime and it appears that lowering the migrations improves
> the hackbench performance significantly [1][2][3]
>
> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
> [2] https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@xxxxxxxxxxxx/
> [3] https://lore.kernel.org/lkml/20231019160523.1582101-1-mathieu.desnoyers@xxxxxxxxxxxx/
>
> Since migration itself is not cheap, I believe the chain migration at
> the current scale hampers the performance since sched-messaging
> emulates a worst-case scenario for proxy-execution.

Hrm.

> I'll update the thread once I have more information. I'll continue
> testing and take a closer look at the implementation.
>
> > I also briefly re-tested with this v5 series
> > and saw some average latencies grow vs v4, suggesting the changes
> > to return-migration (and extra locking) have some impact. With v6
> > the extra overhead is reduced but still not as nice as v4. I’ll
> > be digging more there, but my priority is still stability over
> > speed at this point (it’s easier to validate correctness of
> > optimizations if the baseline isn’t crashing).
> >
> >
> > If folks find it easier to test/tinker with, this patch series
> > can also be found here:
> > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
> > https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6
>
> P.S. I was using the above tree.

Ok, I've been working on getting v7 ready which includes two main things:
1) I've reworked the return migration back into the ttwu path to avoid
the lock juggling
2) Working to properly conditionalize and re-add Connor's
chain-migration feature (which when a migration happens pulls the full
blocked_donor list with it)

So I'll try to reproduce your results and see if these help any with
this particular case, and then I'll start to look closer at what can
be done.

Again, thanks so much, I've got so much gratitude for your testing and
analysis here. I really appreciate your feedback!
Do let me know if you find anything further!

thanks
-john

Next message: Kees Cook: "Re: [PATCH 2/6] wifi: ath10k: use flexible arrays for WMI start scan TLVs"
Previous message: Greg Kroah-Hartman: "Re: Linux 6.6.7"
In reply to: Metin Kaya: "Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6"
Next in thread: K Prateek Nayak: "Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]