Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

From: Peter Zijlstra
Date: Fri Jul 18 2014 - 06:16:59 EST


On Fri, Jul 18, 2014 at 12:34:49AM -0500, Bruno Wolff III wrote:
> On Thu, Jul 17, 2014 at 14:35:02 +0200,
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> >In any case, can someone who can trigger this run with the below; its
> >'clean' for me, but supposedly you'll trigger a FAIL somewhere.
>
> I got a couple of fail messages.
>
> dmesg output is available in the bug as the following attachment:
> https://bugzilla.kernel.org/attachment.cgi?id=143361

Thanks!

[ 0.252059] __sdt_alloc: allocated f255b020 with cpus:
[ 0.252147] __sdt_alloc: allocated f255b0e0 with cpus:
[ 0.252229] __sdt_alloc: allocated f255b120 with cpus:
[ 0.252311] __sdt_alloc: allocated f255b160 with cpus:

[ 0.252395] __sdt_alloc: allocated f255b1a0 with cpus:
[ 0.252477] __sdt_alloc: allocated f255b1e0 with cpus:
[ 0.252559] __sdt_alloc: allocated f255b220 with cpus:
[ 0.252641] __sdt_alloc: allocated f255b260 with cpus:

[ 0.253013] __sdt_alloc: allocated f255b2a0 with cpus:
[ 0.253097] __sdt_alloc: allocated f255b2e0 with cpus:
[ 0.253184] __sdt_alloc: allocated f255b320 with cpus:
[ 0.253265] __sdt_alloc: allocated f255b360 with cpus:

[ 0.253354] build_sched_groups: got group f255b020 with cpus:
[ 0.253436] build_sched_groups: got group f255b120 with cpus:
[ 0.253519] build_sched_groups: got group f255b1a0 with cpus:
[ 0.253600] build_sched_groups: got group f255b2a0 with cpus:
[ 0.253681] build_sched_groups: got group f255b2e0 with cpus:

[ 0.253762] build_sched_groups: got group f255b320 with cpus:
[ 0.253843] build_sched_groups: got group f255b360 with cpus:
[ 0.254004] build_sched_groups: got group f255b0e0 with cpus:
[ 0.254087] build_sched_groups: got group f255b160 with cpus:
[ 0.254170] build_sched_groups: got group f255b1e0 with cpus:
[ 0.254252] build_sched_groups: FAIL
[ 0.254331] build_sched_groups: got group f255b1a0 with cpus: 0
[ 0.255004] build_sched_groups: FAIL
[ 0.255084] build_sched_groups: got group f255b1e0 with cpus: 1

So from previous msgs we know:

CPU0 CPU1 CPU2 CPU3

D0 * * SMT
* *

D2 * * * * DIE


This gives us (from __sdt_alloc):

020 0e0 120 160 SMT
1a0 1e0 220 260 MC
2a0 2e0 320 360 DIE

Given that you have a DIE domain, and MC is found degenerate, I'll
conclude that you do not have the shared L3 possible for your machine
and only have the dual socket, with 2 threads per socket.

So the domains _should_ look like:

D0 0,2 1,3 0,2 1,3
D1 0,2 1,3 0,2 1,3
D2 0,1,2,3 0,1,2,3 0,1,2,3 0,1,2,3

Assuming that, build_sched_groups(), which gets called for each cpu, for
each domain, we get:

D0g 020(0) 120(2)
D1g 1a0(0,2)
D2g 2a0(0,2)

So far so good, at this point we're in build_sched_groups, we have a
.cpu=0 @span=0-3 @covered=0,2 @i=0 and we're just about to start the
loop for @i=1.

1 is not set in covered

get_group(i=1, sdd, &sg)
@sd = *per_cpu_ptr(sdd->sd, 1); /* should be D2 for CPU1 */
@child = sd->child; /* should be D1 for CPU1: 1,3 */
@cpu = 1
@sg = *per_cpu_ptr(sdd->sg, 1); /* should be: 2e0 */

But instead we get 320 !?

The 2e0 group would cover 1,3, thereby increasing @cover to 0-3 and
we're done for CPU0. Instead things go on to return 360, more WTF!

So it looks like the actual domain tree is broken, and not what we
assumed it was.

Could I bother you to run with the below instead? It should also print
out the sched domain masks so we don't need to guess about them.

(make sure you have CONFIG_SCHED_DEBUG=y otherwise it will not build)

> I also booted with early printk=keepsched_debug as requested by Dietmar.

can you make that: sched_debug ?

---
kernel/sched/core.c | 22 ++++++++++++++++++++++
lib/vsprintf.c | 5 +++++
2 files changed, 27 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599dc4aa4..4babcbbc11b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5857,6 +5857,17 @@ build_sched_groups(struct sched_domain *sd, int cpu)
continue;

group = get_group(i, sdd, &sg);
+
+ if (!cpumask_empty(sched_group_cpus(sg)))
+ printk("%s: FAIL\n", __func__);
+
+ printk("%s: got group %p with cpus: %pc\n",
+ __func__,
+ sg,
+ sched_group_cpus(sg));
+
+ cpumask_clear(sched_group_cpus(sg));
+
cpumask_setall(sched_group_mask(sg));

for_each_cpu(j, span) {
@@ -6418,6 +6429,11 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sg)
return -ENOMEM;

+ printk("%s: allocated %p with cpus: %pc\n",
+ __func__,
+ sg,
+ sched_group_cpus(sg));
+
sg->next = sg;

*per_cpu_ptr(sdd->sg, j) = sg;
@@ -6474,6 +6490,12 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
if (!sd)
return child;

+ printk("%s: cpu: %d level: %s cpu_map: %pc tl->mask: %pc\n",
+ __func__,
+ cpu, tl->name,
+ cpu_map,
+ tl->mask(cpu));
+
cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
if (child) {
sd->level = child->level + 1;
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 6fe2c84eb055..ac22c46fd6d0 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -28,6 +28,7 @@
#include <linux/ioport.h>
#include <linux/dcache.h>
#include <linux/cred.h>
+#include <linux/cpumask.h>
#include <net/addrconf.h>

#include <asm/page.h> /* for PAGE_SIZE */
@@ -1250,6 +1251,7 @@ int kptr_restrict __read_mostly;
* (default assumed to be phys_addr_t, passed by reference)
* - 'd[234]' For a dentry name (optionally 2-4 last components)
* - 'D[234]' Same as 'd' but for a struct file
+ * - 'c' For a cpumask list
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
@@ -1389,6 +1391,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
return dentry_name(buf, end,
((const struct file *)ptr)->f_path.dentry,
spec, fmt);
+ case 'c':
+ return buf + cpulist_scnprintf(buf, end - buf, ptr);
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1635,6 +1639,7 @@ int format_decode(const char *fmt, struct printf_spec *spec)
* case.
* %*ph[CDN] a variable-length hex string with a separator (supports up to 64
* bytes of the input)
+ * %pc print a cpumask as comma-separated list
* %n is ignored
*
* ** Please update Documentation/printk-formats.txt when making changes **

Attachment: pgpiyKhUHcTaD.pgp
Description: PGP signature