Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: K Prateek Nayak
Date: Thu Jun 01 2023 - 23:12:53 EST


Hello Peter,

Thank you for taking a look at the report.

On 6/1/2023 4:43 PM, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> Sharing some initial benchmark results with the patch below.
>>
>> tl;dr
>>
>> - Hackbench starts off well but performance drops as the number of groups
>> increases.
>>
>> - schbench (old), tbench, netperf see improvement but there is a band of
>> outlier results when system is fully loaded or slightly overloaded.
>>
>> - Stream and ycsb-mongodb are don't mind the extra search.
>>
>> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
>> very happy.
>
> Figures :/ Every time something like this is changed someone gets to be
> sad..
>
>> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
>> running in NPS1 mode. Following it the simplified machine topology:
>
> Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.

Yes, correct!

>
>> ~~~~~~~~~~~~~~~~~~~~~~~
>> ~ SPECjbb - Multi-JVM ~
>> ~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> - Default Scheduler Tunables
>>
>> kernel max-jOPS critical-jOPS
>> tip 100.00% 100.00%
>> peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)
>>
>> - Modified Scheduler Tunables
>>
>> kernel max-jOPS critical-jOPS
>> tip 100.00% 100.00%
>> peter-next-level 100.00% (0.00%) 102.41% (2.41%)
>
> I'm slightly confused, either the default or the tuned is better. Given
> it's counting ops, I'm thinking higher is more better, so isn't this an
> improvement in the tuned case?

Default is bad. I believe migrating across the LLC boundary is not that
great from cache efficiency perspective here. Setting the tunables
makes task run for longer, and in that case, able to find an idle CPU
seems to be more beneficial.

>
>> ~~~~~~~~~~~~~~~~~~
>> ~ DeathStarBench ~
>> ~~~~~~~~~~~~~~~~~~
>>
>> Pinning Scaling tip peter-next-level
>> 1 CCD 1 100.00% 100.30% (%diff: 0.30%)
>> 2 CCD 2 100.00% 100.17% (%diff: 0.17%)
>> 4 CCD 4 100.00% 99.60% (%diff: -0.40%)
>> 8 CCD 8 100.00% 92.05% (%diff: -7.95%) *
>
> Right, so that's a definite loss.
>
>> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
>> cases but I've not tried tinkering with it yet. I'll continue
>> testing on other NPS modes which would decrease the search scope.
>> I'll also try running the same bunch of workloads on an even larger
>> 4th Generation EPYC server to see if the behavior there is similar.
>
>>> /*
>>> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
>>> + * local LLC comes up empty.
>>> + */
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> + struct sched_domain *parent = sd->parent;
>>> + struct sched_group *sg;
>>> +
>>> + /* Make sure to not cross nodes. */
>>> + if (!parent || parent->flags & SD_NUMA)
>>> + return -1;
>>> +
>>> + sg = parent->groups;
>>> + do {
>>> + int cpu = cpumask_first(sched_group_span(sg));
>>> + struct sched_domain *sd_child;
>>> +
>>> + sd_child = per_cpu(sd_llc, cpu);
>>> + if (sd_child != sd) {
>>> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
>
> Given how SIS_UTIL is inside select_idle_cpu() it should already be
> effective here, no?

True, but if the entire higher domain is busy, iterating over the groups
itself adds to the scheduling latency only to fallback to the target.
Wondering if keeping track of the largest "sd_llc_shared->nr_idle_scan"
and the corresponding group, and starting from there makes sense.

>
>>> + if ((unsigned)i < nr_cpumask_bits)
>>> + return i;
>>> + }
>>> +
>>> + sg = sg->next;
>>> + } while (sg != parent->groups);
>>> +
>>> + return -1;
>>> +}
>
> This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
> isn't too much of a bother; so perhaps something like so?
>
> (on top of tip/sched/core from just a few hours ago, as I had to 'fix'
> this patch and force pushed the thing)
>
> And yeah, random hacks and heuristics here :/ Does there happen to be
> additional topology that could aid us here? Does the CCD fabric itself
> have a distance metric we can use?
>

We do not have a CCD to CCD distance metric unfortunately :(
However NPS modes should mitigate some of the problems of the larger
search space (but adds additional NUMA complexity) which I still need to
test.

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 22e0a249e0a8..f1d6ed973410 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> {
> struct sched_domain *parent = sd->parent;
> struct sched_group *sg;
> + int nr = 4;
>
> /* Make sure to not cross nodes. */
> if (!parent || parent->flags & SD_NUMA)
> @@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> test_idle_cores(cpu), cpu);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
> +
> + if (!--nr)
> + return -1;
> }
>
> sg = sg->next;

--
Thanks and Regards,
Prateek