Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed

From: Peter Zijlstra
Date: Tue Apr 25 2017 - 08:17:40 EST


On Mon, Apr 24, 2017 at 12:11:59PM -0300, Lauro Venancio wrote:
> On 04/24/2017 10:03 AM, Peter Zijlstra wrote:
> > On Thu, Apr 20, 2017 at 04:51:43PM -0300, Lauro Ramos Venancio wrote:
> >
> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> index e77c93a..694e799 100644
> >> --- a/kernel/sched/topology.c
> >> +++ b/kernel/sched/topology.c
> >> @@ -505,7 +507,11 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
> >>
> >> for_each_cpu(i, sg_span) {
> >> sibling = *per_cpu_ptr(sdd->sd, i);

> >> - if (!cpumask_test_cpu(i, sched_domain_span(sibling)))

> >> + if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
> >> continue;

Hmm _this_ is what requires us to move the thing to a whole separate
iteration. Because when we build the groups, the domains are already
constructed, so that was right.

So the moving crud around wasn't the primary fix, this is.

With the fact that sched_group_cpus(sd->groups) ==
sched_domain_span(sibling->child) (if child exists) established in the
previous patches, could we not simplify this like the below?

---
Subject: sched/topology: Fix overlapping sched_group_mask
From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Tue Apr 25 14:00:49 CEST 2017

The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

node 0 1 2 3
0: 10 20 30 20
1: 20 10 20 30
2: 30 20 10 20
3: 20 30 20 10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 (mask: 1), 2, 0
[] domain 1: span 0-3 level NUMA
[] groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 (mask: 1), 2, 0
[] domain 1: span 0-3 level NUMA
[] groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

Debugged-by: Lauro Ramos Venancio <lvenanci@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
kernel/sched/topology.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -495,6 +495,9 @@ enum s_alloc {
/*
* Build an iteration mask that can exclude certain CPUs from the upwards
* domain traversal.
+ *
+ * Only CPUs that can arrive at this group should be considered to continue
+ * balancing.
*/
static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
{
@@ -505,7 +508,13 @@ static void build_group_mask(struct sche

for_each_cpu(i, sg_span) {
sibling = *per_cpu_ptr(sdd->sd, i);
- if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
+
+ /* overlap should have children; except for FORCE_SD_OVERLAP */
+ if (WARN_ON_ONCE(!sibling->child))
+ continue;
+
+ /* If we would not end up here, we can't continue from here */
+ if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
continue;

cpumask_set_cpu(i, sched_group_mask(sg));