Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

From: K Prateek Nayak
Date: Mon Nov 06 2023 - 02:06:47 EST


Hello Chenyu,

On 11/6/2023 11:22 AM, Chen Yu wrote:
> On 2023-10-27 at 08:57:00 +0530, K Prateek Nayak wrote:
>> Hello Mathieu,
>>
>> On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote:
>>> Hi,
>>>
>>> This series introduces two new scheduler features: UTIL_FITS_CAPACITY
>>> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of
>>> a hackbench workload which leaves some idle CPU time on a 192-core AMD
>>> EPYC.
>>>
>>> The main metrics which are significantly improved are:
>>>
>>> - cpu-migrations are reduced by 80%,
>>> - CPU utilization is increased by 17%.
>>>
>>> Feedback is welcome. I am especially interested to learn whether this
>>> series has positive or detrimental effects on performance of other
>>> workloads.
>>
>> I got a chance to test this series on a dual socket 3rd Generation EPYC
>> System (2 x 64C/128T). Following is a quick summary:
>>
>> - stream and ycsb-mongodb don't see any changes.
>>
>> - hackbench and DeathStarBench see a major improvement. Both are high
>> utilization workloads with CPUs being overloaded most of the time.
>> DeathStarBench is known to benefit from lower migration count. It was
>> discussed by Gautham at OSPM '23.
>>
>> - tbench, netperf, and sch bench regresses. The former two when the
>> system is near fully loaded, and the latter for most cases.
>
> Does it mean hackbench gets benefits when the system is overloaded, while
> tbench/netperf do not get benefit when the system is underloaded?

Yup! Seems like that from the results. From what I have seen so far,
there seems to be a work conservation aspect to hackbench where if we
reduce the time spent in the kernel (by reducing time to decide on the
target which Mathieu's patch [this one] achieves, there is also a
second order effect from another one of Mathieu's Patches that uses
wakelist but indirectly curbs the SIS_UTIL limits based on Aaron's
observation [1] thus reducing time spent in select_idle_cpu())
hackbench results seem to improve.

[1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/

schbench, tbench, and netperf see that wakeups are faster when the
client and server are on same LLC so consolidation as long as there is
one task per run queue for under loaded case is better than just keeping
them on separate LLCs.

>
>> All these benchmarks are client-server / messenger-worker oriented and is
>> known to perform better when client-server / messenger-worker are on
>> same CCX (LLC domain).
>
> I thought hackbench should also be of client-server mode, because hackbench has
> socket/pipe mode and exchanges datas between sender/receiver.

Yes but its N:M nature makes it slightly complicated to understand where
the cache benefits disappear and the work conservation benefits become
more prominent.

>
> This reminds me of your proposal to provide user hint to the scheduler
> to whether do task consolidation vs task spreading, and could this also
> be applied to Mathieu's case? For task or task group with "consolidate"
> flag set, tasks prefer to be woken up on target/previous CPU if the wakee
> fits into that CPU. In this way we could bring benefit and not introduce
> regress.

I think even a simple WF_SYNC check will help tbench and netperf case.
Let me get back to you with some data on different variants of hackbench
wit the latest tip.

>
> thanks,
> Chenyu

--
Thanks and Regards,
Prateek