Re: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache

From: Darren Hart
Date: Fri Feb 11 2022 - 12:54:51 EST


On Fri, Feb 11, 2022 at 03:20:51AM +0000, Song Bao Hua (Barry Song) wrote:
>
>
> > -----Original Message-----
> > From: Darren Hart [mailto:darren@xxxxxxxxxxxxxxxxxxxxxx]
> > Sent: Friday, February 11, 2022 2:43 PM
> > To: LKML <linux-kernel@xxxxxxxxxxxxxxx>; Linux Arm
> > <linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>
> > Cc: Catalin Marinas <catalin.marinas@xxxxxxx>; Will Deacon <will@xxxxxxxxxx>;
> > Peter Zijlstra <peterz@xxxxxxxxxxxxx>; Vincent Guittot
> > <vincent.guittot@xxxxxxxxxx>; Song Bao Hua (Barry Song)
> > <song.bao.hua@xxxxxxxxxxxxx>; Valentin Schneider
> > <valentin.schneider@xxxxxxx>; D . Scott Phillips
> > <scott@xxxxxxxxxxxxxxxxxxxxxx>; Ilkka Koskinen
> > <ilkka@xxxxxxxxxxxxxxxxxxxxxx>; stable@xxxxxxxxxxxxxxx
> > Subject: [PATCH] arm64: smp: Skip MC domain for SoCs without shared cache
> >
> > SoCs such as the Ampere Altra define clusters but have no shared
> > processor-side cache. As of v5.16 with CONFIG_SCHED_CLUSTER and
> > CONFIG_SCHED_MC, build_sched_domain() will BUG() with:
> >
> > BUG: arch topology borken
> > the CLS domain not a subset of the MC domain
> >
> > for each CPU (160 times for a 2 socket 80 core Altra system). The MC
> > level cpu mask is then extended to that of the CLS child, and is later
> > removed entirely as redundant.
> >
> > This change detects when all cpu_coregroup_mask weights=1 and uses an
> > alternative sched_domain_topology equivalent to the default if
> > CONFIG_SCHED_MC were disabled.
> >
> > The final resulting sched domain topology is unchanged with or without
> > CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> >
> > For CPU0:
> >
> > With CLS:
> > CLS [0-1]
> > DIE [0-79]
> > NUMA [0-159]
> >
> > Without CLS:
> > DIE [0-79]
> > NUMA [0-159]
> >
> > Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
> > Cc: Will Deacon <will@xxxxxxxxxx>
> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > Cc: Barry Song <song.bao.hua@xxxxxxxxxxxxx>
> > Cc: Valentin Schneider <valentin.schneider@xxxxxxx>
> > Cc: D. Scott Phillips <scott@xxxxxxxxxxxxxxxxxxxxxx>
> > Cc: Ilkka Koskinen <ilkka@xxxxxxxxxxxxxxxxxxxxxx>
> > Cc: <stable@xxxxxxxxxxxxxxx> # 5.16.x
> > Signed-off-by: Darren Hart <darren@xxxxxxxxxxxxxxxxxxxxxx>
>
> Hi Darrent,

Hi Barry, thanks for the review.

> What kind of resources are clusters sharing on Ampere Altra?

The cluster pairs are DSU pairs (ARM DynamIQ Shared Unit). While there
is no shared L3 cache, they do share an SCU (snoop control unit) and
have a cache coherency latency advantage relative to non-DSU pairs.

The Anandtech Altra review illustrates this advantage:
https://www.anandtech.com/show/16315/the-ampere-altra-review/3

Notably, the SCHED_CLUSTER change did result in marked improvements for
some interactive workloads.

> So on Altra, cpus are not sharing LLC? Each LLC is separate
> for each cpu?

Correct. On the processor side the last level is at each cpu, and then
there is a memory side SLC (system level cache) that is shared by all
cpus.

>
> > ---
> > arch/arm64/kernel/smp.c | 32 ++++++++++++++++++++++++++++++++
> > 1 file changed, 32 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index 27df5c1e6baa..0a78ac5c8830 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -715,9 +715,22 @@ void __init smp_init_cpus(void)
> > }
> > }
> >
> > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > +#ifdef CONFIG_SCHED_SMT
> > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > +#endif
> > +
> > +#ifdef CONFIG_SCHED_CLUSTER
> > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > +#endif
> > + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > + { NULL, },
> > +};
> > +
> > void __init smp_prepare_cpus(unsigned int max_cpus)
> > {
> > const struct cpu_operations *ops;
> > + bool use_no_mc_topology = true;
> > int err;
> > unsigned int cpu;
> > unsigned int this_cpu;
> > @@ -758,6 +771,25 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> >
> > set_cpu_present(cpu, true);
> > numa_store_cpu_info(cpu);
> > +
> > + /*
> > + * Only use no_mc topology if all cpu_coregroup_mask weights=1
> > + */
> > + if (cpumask_weight(cpu_coregroup_mask(cpu)) > 1)
> > + use_no_mc_topology = false;
>
> This seems to be wrong? If you have 5 cpus,
> Cpu0 has cpu_coregroup_mask(cpu)== 1, cpu1-4
> has cpu_coregroup_mask(cpu)== 4, for cpu0, you still
> need to remove MC, but for cpu1-4, you will need
> CLS and MC both?
>
> This flag shouldn't be global.

Please note that this patch is intended to maintain an identical final
sched domain construction for a symmetric topology with no shared
processor-side cache and with cache advantaged clusters and avoid the
BUG messages since this topology is correct for this architecture.

By using a sched topology without the MC layer, this more accurately
describes this architecture and does not require changes to
build_sched_domain(), in particular changes to the assumptions about
what a valid topology is.

The test above tests every cpu coregroup weight in order to limit the
impact of this change to this specific kind of topology. It
intentionally does not address, nor change existing behavior for, the
assymetrical topology you describe.

I felt this was the less invasive approach and consistent with how other
architectures handled "non-default" topologies.

If there is interest on working toward a more generic topology builder,
I'd be interested in working on that too, but I think this change makes
sense in the near term.

Thanks,

>
> > + }
> > +
> > + /*
> > + * SoCs with no shared processor-side cache will have cpu_coregroup_mask
> > + * weights=1. If they also define clusters with cpu_clustergroup_mask
> > + * weights > 1, build_sched_domain() will trigger a BUG as the CLS
> > + * cpu_mask will not be a subset of MC. It will extend the MC cpu_mask
> > + * to match CLS, and later discard the MC level. Avoid the bug by using
> > + * a topology without the MC if the cpu_coregroup_mask weights=1.
> > + */
> > + if (use_no_mc_topology) {
> > + pr_info("cpu_coregroup_mask weights=1, skipping MC topology level");
> > + set_sched_topology(arm64_no_mc_topology);
> > }
> > }
> >
> > --
> > 2.31.1
>
>
> Thanks
> Barry
>

--
Darren Hart
Ampere Computing / OS and Kernel