Re: NMI watchdog triggering during load_balance

From: Peter Zijlstra
Date: Sat Mar 07 2015 - 04:40:30 EST


On Fri, Mar 06, 2015 at 11:37:11AM -0700, David Ahern wrote:
> On 3/6/15 11:11 AM, Mike Galbraith wrote:
> In responding earlier today I realized that the topology is all wrong as you
> were pointing out. There should be 16 NUMA domains (4 memory controllers per
> socket and 4 sockets). There should be 8 sibling cores. I will look into why
> that is not getting setup properly and what we can do about fixing it.

So we changed the numa topology setup a while back; see commit
cb83b629bae0 ("sched/numa: Rewrite the CONFIG_NUMA sched domain
support").

> But, I do not understand how the wrong topology is causing the NMI watchdog
> to trigger. In the end there are still N domains, M groups per domain and P
> cpus per group. Doesn't the balancing walk over all of them irrespective of
> physical topology?

Not quite; so for regular load balancing only the first CPU in the
domain will iterate up.

So if you have 4 'nodes' only 4 CPUs will iterate the entire machine,
not all 1024.



> Call Trace:
> [000000000045dc30] double_rq_lock+0x4c/0x68
> [000000000046a23c] load_balance+0x278/0x740
> [00000000008aa178] __schedule+0x378/0x8e4
> [00000000008aab1c] schedule+0x68/0x78
> [00000000004718ac] do_exit+0x798/0x7c0
> [000000000047195c] do_group_exit+0x88/0xc0
> [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
> [000000000042cbc0] do_signal+0x70/0x5e4
> [000000000042d14c] do_notify_resume+0x18/0x50
> [00000000004049c4] __handle_signal+0xc/0x2c
>
>
> For example the stream program has 1024 threads (1 for each CPU). If you
> ctrl-c the program or wait for it terminate that's when it trips. Other
> workloads that routinely trip it are make -j N, N some number (e.g., on a
> 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c
> ... boom with the above stack trace.
>
> Code wise ... and this is still present in 3.18 and 3.20:
>
> schedule()
> - __schedule()
> + irqs disabled: raw_spin_lock_irq(&rq->lock);
>
> pick_next_task
> - idle_balance()

> For 2.6.39 it's the invocation of idle_balance which is triggering load
> balancing with IRQs disabled. That's when the NMI watchdog trips.

So for idle_balance() look at SD_BALANCE_NEWIDLE, only domains with that
set will get iterated.

I suppose you could try something like the below on 3.18

Which will disable SD_BALANCE_NEWDILE on all 'distant' nodes; but first
check how your fixed numa topology looks and if you trigger that case at
all.

---
kernel/sched/core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17141da77c6e..7fce683928fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6268,6 +6268,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
+ SD_BALANCE_NEWIDLE |
SD_WAKE_AFFINE);
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/