Re: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

From: Valentin Schneider
Date: Tue Feb 02 2021 - 10:20:49 EST


On 01/02/21 16:38, Barry Song wrote:
> @@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct sched_domain *sd,
>
> build_balance_mask(sd, sg, mask);
> cpu = cpumask_first_and(sched_group_span(sg), mask);
> + /*
> + * for the group generated by grandchild, use the sgc of 2nd cpu
> + * because the 1st cpu might be used by another sched_group
> + */
> + if (from_grandchild && cpumask_weight(mask) > 1)
> + cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
>
> sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);

So you are getting a (hopefully) unique ID for this group span at this
given topology level (i.e. sd->private) but as I had stated in that list of
issues, this creates an sgc that isn't attached to the local group of any
sched_domain, and thus won't get its capacity values updated.

This can actually be seen via the capacity values you're getting at build
time:

> [ 0.868907] CPU0 attaching sched-domain(s):
...
> [ 0.869542] domain-2: span=0-5 level=NUMA
> [ 0.869559] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
^^^^^^^^^^^^^^^^
> [ 0.871177] CPU4 attaching sched-domain(s):
...
> [ 0.871200] groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
> [ 0.871243] domain-1: span=4-7 level=NUMA
> [ 0.871257] groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
^^^^^^^^^^^^^^^^

IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc
of CPU4-domain1-group4. I've done that in the below diff - this gives us
groups with sgc's owned at lower topology levels, but this will only ever
be true for non-local groups. This has the added benefit of working with
single-CPU nodes. Briefly tested on your topology and the sunfire's (via
QEMU), and I didn't get screamed at.

Before the fun police comes and impounds my keyboard, I'd like to point out
that we could leverage this cross-level sgc referencing hack to further
change the NUMA domains and pretty much get rid of overlapping groups
(that's what I was fumbling with in [1]).

[1]: http://lore.kernel.org/r/jhjwnw11ak2.mognet@xxxxxxx

That is, rather than building overlapping groups and fixing them whenever
that breaks (distance > 2), we could have:
- the local group being the child domain's span (as always)
- all non-local NUMA groups spanning a single node each, with the right sgc
cross-referencing.

Thoughts?

--->8---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b748999c9e11..ef43abb6b1fb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)

static void init_overlap_sched_group(struct sched_domain *sd,
struct sched_group *sg,
- int from_grandchild)
+ struct sched_domain *grandchild)
{
struct cpumask *mask = sched_domains_tmpmask2;
- struct sd_data *sdd = sd->private;
+ struct sd_data *sdd = grandchild ? grandchild->private : sd->private;
struct cpumask *sg_span;
int cpu;

build_balance_mask(sd, sg, mask);
cpu = cpumask_first_and(sched_group_span(sg), mask);
- /*
- * for the group generated by grandchild, use the sgc of 2nd cpu
- * because the 1st cpu might be used by another sched_group
- */
- if (from_grandchild && cpumask_weight(mask) > 1)
- cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);

sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
if (atomic_inc_return(&sg->sgc->ref) == 1)
@@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)

for_each_cpu_wrap(i, span, cpu) {
struct cpumask *sg_span;
- int from_grandchild = 0;
+ bool from_grandchild = false;

if (cpumask_test_cpu(i, covered))
continue;
@@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
!cpumask_subset(sched_domain_span(sibling->child),
span)) {
sibling = sibling->child;
- from_grandchild = 1;
+ from_grandchild = true;
}

sg = build_group_from_child_sched_domain(sibling, cpu);
@@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
sg_span = sched_group_span(sg);
cpumask_or(covered, covered, sg_span);

- init_overlap_sched_group(sd, sg, from_grandchild);
+ init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);

if (!first)
first = sg;