Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6

From: K Prateek Nayak
Date: Thu Dec 14 2023 - 00:15:42 EST


Hello John,

Thank you for taking a look at the report.

On 12/14/2023 12:41 AM, John Stultz wrote:
> On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>>
>> Hello John,
>>
>> I may have some data that might help you debug a potential performance
>> issue mentioned below.
>
> Hey Prateek,
> Thank you so much for taking the time to try this out and providing
> such helpful analysis!
> More below.
>
>> On 11/7/2023 1:04 AM, John Stultz wrote:
>>> [..snip..]
>>>
>>> Performance:
>>> —----------
>>> This patch series switches mutexes to use handoff mode rather
>>> than optimistic spinning. This is a potential concern where locks
>>> are under high contention. However, earlier performance analysis
>>> (on both x86 and mobile devices) did not see major regressions.
>>> That said, Chenyu did report a regression[3], which I’ll need to
>>> look further into.
>>
>> I too see this as the most notable regression. Some of the other
>> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
>> DeathStarBench) show little to no difference when running with Proxy
>
> This is great to hear! Thank you for providing this input!
>
>> Execution, however sched-messaging sees a 10x blowup in the runtime.
>> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)
>
> Oof. I appreciate you sharing this!
>
>> While investigating, I drew up the runqueue length when running
>> sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
>> EPYC system) using the following bpftrace script that dumps it csv
>> format:
>
> Just so I'm following you properly on the processor you're using, cpus
> 0-7 and 128-125 are in the same CCX?
> (I thought there were only 8 cores per CCX?)

Sorry about that! It should be 0-7,128-135 (16 threads of 8 cores in the
same CCX) The pinning was added so that I could only observe a subset of
the total CPUs since analyzing the behavior of 40 tasks on 256 CPUs was
much harder than analyzing it on 16 CPUs :)

>
>> rqlen.bt
>> ---
> <snip>
>> --
>>
>> I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy
>> execution (rqlen50-pe-pinned.csv) below.
>>
>> The trend I see with hackbench is that the chain migration leads
>> to a single runqueue being completely overloaded, followed by some
>> amount of the idling on the entire CCX and a similar chain appearing
>> on a different CPU. The trace for tip show a lot more CPUs being
>> utilized.
>>
>> Mathieu has been looking at hackbench and the effect of task migration
>> on the runtime and it appears that lowering the migrations improves
>> the hackbench performance significantly [1][2][3]
>>
>> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
>> [2] https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@xxxxxxxxxxxx/
>> [3] https://lore.kernel.org/lkml/20231019160523.1582101-1-mathieu.desnoyers@xxxxxxxxxxxx/
>>
>> Since migration itself is not cheap, I believe the chain migration at
>> the current scale hampers the performance since sched-messaging
>> emulates a worst-case scenario for proxy-execution.
>
> Hrm.
>
>> I'll update the thread once I have more information. I'll continue
>> testing and take a closer look at the implementation.
>>
>>> I also briefly re-tested with this v5 series
>>> and saw some average latencies grow vs v4, suggesting the changes
>>> to return-migration (and extra locking) have some impact. With v6
>>> the extra overhead is reduced but still not as nice as v4. I’ll
>>> be digging more there, but my priority is still stability over
>>> speed at this point (it’s easier to validate correctness of
>>> optimizations if the baseline isn’t crashing).
>>>
>>>
>>> If folks find it easier to test/tinker with, this patch series
>>> can also be found here:
>>> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
>>> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6
>>
>> P.S. I was using the above tree.
>
> Ok, I've been working on getting v7 ready which includes two main things:
> 1) I've reworked the return migration back into the ttwu path to avoid
> the lock juggling
> 2) Working to properly conditionalize and re-add Connor's
> chain-migration feature (which when a migration happens pulls the full
> blocked_donor list with it)
>
> So I'll try to reproduce your results and see if these help any with
> this particular case, and then I'll start to look closer at what can
> be done.
>
> Again, thanks so much, I've got so much gratitude for your testing and
> analysis here. I really appreciate your feedback!
> Do let me know if you find anything further!

Sure thing! I'll keep you updated of any finding. Thank you for digging
further into this issue.

>
> thanks
> -john

--
Thanks and Regards,
Prateek