Re: NMI watchdog triggering during load_balance

From: David Ahern
Date: Fri Mar 06 2015 - 13:37:50 EST


On 3/6/15 11:11 AM, Mike Galbraith wrote:
That was the question, _do_ you have any control, because that topology
is toxic. I guess your reply means 'nope'.

The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8
threads per core and each cpu has 4 memory controllers.

Thank god I've never met one of these, looks like the box from hell :)

If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a
noticeable improvement -- watchdog does not trigger and I do not get the
rq locks held for 2-3 seconds. But there is still fairly high cpu usage
for an idle system. Perhaps I should leave SCHED_MC on and disable
SCHED_SMT; I'll try that today.

Well, if you disable SMT,your troubles _should_ shrink radically, as
your box does. You should probably look at why you have CPU domains.
You don't ever want to see that on a NUMA box.

In responding earlier today I realized that the topology is all wrong as you were pointing out. There should be 16 NUMA domains (4 memory controllers per socket and 4 sockets). There should be 8 sibling cores. I will look into why that is not getting setup properly and what we can do about fixing it.

--

But, I do not understand how the wrong topology is causing the NMI watchdog to trigger. In the end there are still N domains, M groups per domain and P cpus per group. Doesn't the balancing walk over all of them irrespective of physical topology?

Here's another data point that jelled this morning explaining the problem to someone: the NMI watchdog trips on a mass exit:

TPC: <_raw_spin_trylock_bh+0x38/0x100>
g0: 7fffffffffffffff g1: 00000000000000ff g2: 0000000000070f8c g3: fffe403b97891c98
g4: fffe803b963eda00 g5: 000000010036c000 g6: fffe803b84108000 g7: 0000000000000093
o0: 0000000000000fe0 o1: 0000000000000fe0 o2: ffffff0000000000 o3: 0000000000200200
o4: 0000000000a98080 o5: 0000000000000000 sp: fffe803b8410ada1 ret_pc: 00000000006800dc
RPC: <cpumask_next_and+0x44/0x6c>
l0: 0000000000e9b114 l1: 0000000000000001 l2: 0000000000000001 l3: 0000000000000005
l4: 0000000000002000 l5: fffe803b8410b990 l6: 0000000000000004 l7: 0000000000f267b0
i0: 0000000100b10700 i1: 00000000ffffffff i2: 0000000101324d80 i3: fffe803b8410b6c0
i4: 0000000000000038 i5: 0000000000000498 i6: fffe803b8410ae51 i7: 000000000045dc30
I7: <double_rq_lock+0x4c/0x68>
Call Trace:
[000000000045dc30] double_rq_lock+0x4c/0x68
[000000000046a23c] load_balance+0x278/0x740
[00000000008aa178] __schedule+0x378/0x8e4
[00000000008aab1c] schedule+0x68/0x78
[00000000004718ac] do_exit+0x798/0x7c0
[000000000047195c] do_group_exit+0x88/0xc0
[0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
[000000000042cbc0] do_signal+0x70/0x5e4
[000000000042d14c] do_notify_resume+0x18/0x50
[00000000004049c4] __handle_signal+0xc/0x2c


For example the stream program has 1024 threads (1 for each CPU). If you ctrl-c the program or wait for it terminate that's when it trips. Other workloads that routinely trip it are make -j N, N some number (e.g., on a 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c ... boom with the above stack trace.

Code wise ... and this is still present in 3.18 and 3.20:

schedule()
- __schedule()
+ irqs disabled: raw_spin_lock_irq(&rq->lock);

pick_next_task
- idle_balance()

+ irqs enabled:
different task: context_switch(rq, prev, next)
--> finish_lock_switch eventually
same task: raw_spin_unlock_irq(&rq->lock) or


For 2.6.39 it's the invocation of idle_balance which is triggering load balancing with IRQs disabled. That's when the NMI watchdog trips.

I'll pound on 3.18 and see if I can reproduce something similar there.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/