[PATCH] sched/topology: Fix overlapping sched_group build

From: Valentin Schneider
Date: Tue Mar 24 2020 - 08:56:47 EST


Platform
========

I've been investigating an assertion failure on a D06 board (Kunpeng 920
based), which AFAIA is 2 sockets, each with 2 dies, each with 24 CPUs and
its own memory controller (4 nodes total).

The default distance table I get is:

node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10

Which can be summarised as:

2 10 2
0 <---> 1 <---> 2 <---> 3

Error
=====

Using sched_debug=1, I get an assertion error for every single CPU. This is
for CPU0:

[344276.794534] CPU0 attaching sched-domain(s):
[344276.794536] domain-0: span=0-23 level=MC
[344276.794539] groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 12:{ span=12 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
[344276.794554] domain-1: span=0-47 level=NUMA
[344276.794555] groups: 0:{ span=0-23 cap=24576 }, 24:{ span=24-47 cap=24576 }
[344276.794558] domain-2: span=0-71 level=NUMA
[344276.794560] groups: 0:{ span=0-47 cap=49152 }, 48:{ span=48-95 cap=49152 }
[344276.794563] ERROR: groups don't span domain->span
[344276.799346] domain-3: span=0-95 level=NUMA
[344276.799353] groups: 0:{ span=0-71 mask=0-23 cap=73728 }, 72:{ span=48-95 mask=72-95 cap=49152 }

Root cause
==========

The NUMA distance setup is correct for node 0 (despite the deduplicating
sort not liking this distance table - that's for another day), and so are
the sched_domain spans (CPU0 PoV):

node0 node1 node2 node3
/^^^^^^\ /^^^^^^^\ /^^^^^^^\ /^^^^^^^\
0 ... 23 24 ... 47 48 ... 71 72 ... 95
MC [ ]
NUMA(<=12) [ ]
NUMA(<=20) [ ]
NUMA(<=22) [ ]

The problem lies in the *groups* of that middle NUMA domain: they are
{[0-47], [48-95]}, IOW it includes node3 when it really shouldn't. Let me
try to explain why:

[]: domain span
(): group span
CPU0 perspective:
,----
| node0 node1 node2 node3
| /^^^^^^^^^^\ /^^^^^^^\ /^^^^^^^\ /^^^^^^^\
| 0 ....... 23 24 ... 47 48 ... 71 72 ... 95
| MC [(0)...(23)]
| NUMA(<=12) [ (0 - 23) (24 - 47)]
`----

CPU48 perspective:
,----
| node0 node1 node2 node3
| /^^^^^^\ /^^^^^^^\ /^^^^^^^^^^^\ /^^^^^^^\
| 0 ... 23 24 ... 47 48 ....... 71 72 ... 95
| MC [(48)...(71)]
| NUMA(<=12) [ (48 - 23) (72 - 95)]
`----

When we go through build_overlap_sched_groups() for CPU0's NUMA(<=20)
domain, we'll first use CPU0's NUMA(<=20) child (i.e. NUMA(<=12)) as span
for the first group, which will be (0-47). The next CPU we'll iterate over
will be CPU48; we'll do the same thing and use its NUMA(<=20) child span as
group span, which here will be (48-95) - despite CPUs (72-95) *not* being
<= 20 hops away from CPU0 (but 22)!

AFAICT this isn't too complicated to reproduce. For instance, a ring of 6
CPUs with no interlink in the middle (indegree(v) == 2 for all v) would
trigger the same issue.

Fix
===

Sanitize the groups we get out of build_group_from_child_sched_domain()
with the span of the domain we're currently building - this ensures the
groups we build only contain CPUs that are the right distance away from the
base CPU. This also requires modifying build_balance_mask().

With the patch applied:

[ 83.177623] domain-0: span=0-23 level=MC
[ 83.177626] groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 12:{ span=12 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
[ 83.177641] domain-1: span=0-47 level=NUMA
[ 83.177642] groups: 0:{ span=0-23 cap=24576 }, 24:{ span=24-47 cap=24576 }
[ 83.177645] domain-2: span=0-71 level=NUMA
[ 83.177646] groups: 0:{ span=0-47 cap=49152 }, 48:{ span=48-71 cap=49152 }
[ 83.177649] domain-3: span=0-95 level=NUMA
[ 83.177651] groups: 0:{ span=0-71 mask=0-23 cap=73728 }, 48:{ span=48-95 cap=73728 }

Note that the modification of build_balance_mask() feels icky, but I
couldn't think of topologies it would break. AFAICT the masks for the
topology pointed out in commit 73bb059f9b8a ("sched/topology: Fix
overlapping sched_group_mask") would remain unchanged.

Signed-off-by: Valentin Schneider <valentin.schneider@xxxxxxx>
---
kernel/sched/topology.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8344757bba6e..7033b27e5162 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -866,7 +866,7 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
continue;

/* If we would not end up here, we can't continue from here */
- if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
+ if (!cpumask_subset(sg_span, sched_domain_span(sibling->child)))
continue;

cpumask_set_cpu(i, mask);
@@ -882,7 +882,9 @@ build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpuma
* statistics having the groups node local is of dubious benefit.
*/
static struct sched_group *
-build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
+build_group_from_child_sched_domain(struct sched_domain *sd,
+ int cpu,
+ const struct cpumask *level_span)
{
struct sched_group *sg;
struct cpumask *sg_span;
@@ -899,6 +901,21 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
else
cpumask_copy(sg_span, sched_domain_span(sd));

+ /*
+ * We're using sibling sched_domains to build up the groups of our NUMA
+ * domains, and those are built up from the point of view of their
+ * respective base CPU. This means that when we copy the span of said
+ * sibling->child domain, we are including CPUs that are
+ * sched_domains_numa_distance[level-1] hops away from the *sibling*
+ * base CPU, not the base CPU of the domain we are currently building
+ * for.
+ *
+ * Correct this by trimming the newly built sched_group with the current
+ * domain span which is known to only include CPUs that are the correct
+ * distance away from the current base CPU.
+ */
+ cpumask_and(sg_span, sg_span, level_span);
+
atomic_inc(&sg->ref);
return sg;
}
@@ -964,7 +981,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;

- sg = build_group_from_child_sched_domain(sibling, cpu);
+ sg = build_group_from_child_sched_domain(sibling, cpu, span);
if (!sg)
goto fail;

--
2.24.0