[BUG] Boot hangs at clocksource_done_booting on large configs

From: Alex Thorlton
Date: Mon Aug 31 2015 - 14:04:40 EST


Hey guys,

We've been hitting issues with machines with 4096 or more logical cores
hanging up at clocksource_done_booting, seemingly forever. The problem
occurs somewhat intermittently, though it hits more consistently as you
increase the core count. We've observed this problem on kernels
starting at 3.10, all the way up through 4.2.

We initially thought there was some issue with the stop_machine code,
but Thomas G. cleared that up for us with a thorough explanation of the
stop_machine mechanism. After ruling out issues in stop_machine, we
started investigating possible hardware related scaling issues.

I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule
out any scaling issues related to multiple concurrent reads to our
RTC's MMR. We start getting hung task timeouts during
clocksource_done_booting:

8<----
calling clocksource_done_booting+0x0/0x42 @ 1
INFO: task swapper/0:1 blocked for more than 480 seconds.
Not tainted 4.2.0-rc1-nortc+ #384
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swapper/0 D ffff881efa933b48 0 1 0 0x00000000
ffff881efa933b48 ffff881ef45e0000 ffffb5de7a130000 000000fdfa933b38
ffff881efa934000 7fffffffffffffff ffff881efa933ca0 ffffb5de7a130000
0000000000000000 ffff881efa933b68 ffffffff8158eae7 ffff881efa933b68
Call Trace:
[<ffffffff8158eae7>] schedule+0x37/0x80
[<ffffffff81591707>] schedule_timeout+0x1e7/0x270
[<ffffffff812ee9b9>] ? _find_next_bit.part.0+0x19/0x70
[<ffffffff8158fdb0>] wait_for_completion+0xa0/0x100
[<ffffffff8108d490>] ? wake_up_q+0x70/0x70
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff810fa2a0>] __stop_cpus+0x50/0x70
[<ffffffff810b8f7d>] ? console_unlock+0x2dd/0x4a0
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff810fa525>] stop_cpus+0x35/0x50
[<ffffffff810d18f0>] ? tk_setup_internals.constprop.10+0x140/0x140
[<ffffffff810fa652>] __stop_machine+0xb2/0xe0
[<ffffffff810d18f0>] ? tk_setup_internals.constprop.10+0x140/0x140
[<ffffffff810d18f0>] ? tk_setup_internals.constprop.10+0x140/0x140
[<ffffffff810fa6ae>] stop_machine+0x2e/0x50
[<ffffffff810d1fad>] timekeeping_notify+0x2d/0x50
[<ffffffff810d3d4b>] __clocksource_select+0xcb/0x120
[<ffffffff81b77cfa>] ? boot_override_clock+0x4d/0x4d
[<ffffffff81b77d2c>] clocksource_done_booting+0x32/0x42
[<ffffffff810003a9>] do_one_initcall+0x169/0x1d0
[<ffffffff81b5613c>] kernel_init_freeable+0x1bb/0x248
[<ffffffff81582560>] ? rest_init+0x80/0x80
[<ffffffff8158256e>] kernel_init+0xe/0xe0
[<ffffffff81592c5f>] ret_from_fork+0x3f/0x70
[<ffffffff81582560>] ? rest_init+0x80/0x80
---->8

After this I NMId the machine to get backtraces for the other CPUs.
Most are stuck around here:

8<----
CPU: 4867 PID: 19730 Comm: migration/4867 Tainted: G D 4.2.0-rc1-nortc+ #384
Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013
task: ffffad5e798cc500 ti: ffffad5e798d4000 task.ti: ffffad5e798d4000
RIP: 0010:[<ffffffff8105e879>] [<ffffffff8105e879>] uv_handle_nmi+0x19/0x790
RSP: 0000:ffffc90020156da8 EFLAGS: 00010092
RAX: 000002fe65b95384 RBX: 000000000000002d RCX: ffffc90020162a98
RDX: 000002fe65b95384 RSI: ffffc90020156ef8 RDI: 0000000000000001
RBP: ffffc90020156e18 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000020 R12: 000002fe65b95384
R13: 0000000000000000 R14: ffffffff81a38880 R15: ffffc90020156ef8
FS: 0000000000000000(0000) GS:ffffc90020151000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000da80 CR3: 0000000001a0b000 CR4: 00000000001406e0
Stack:
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 000000000000002d 000002fe65b95384 0000000000000000
Call Trace:
<NMI>
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff81007869>] nmi_handle+0x79/0x100
[<ffffffff81007873>] ? nmi_handle+0x83/0x100
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff81007c6c>] unknown_nmi_error+0x1c/0x90
[<ffffffff81007e48>] default_do_nmi+0xd8/0xf0
[<ffffffff81007ee3>] do_nmi+0x83/0xb0
[<ffffffff815948bf>] end_repeat_nmi+0x1e/0x2e
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff810486c6>] ? read_hpet+0x16/0x20
[<ffffffff810486c6>] ? read_hpet+0x16/0x20
[<ffffffff810486c6>] ? read_hpet+0x16/0x20
<<EOE>>
<IRQ>
[<ffffffff810d1b97>] ktime_get+0x37/0xb0
[<ffffffff810d819b>] clockevents_program_event+0x4b/0x130
[<ffffffff810d8737>] tick_handle_periodic+0x57/0x70
[<ffffffff8103b46c>] local_apic_timer_interrupt+0x3c/0x70
[<ffffffff815952c1>] smp_apic_timer_interrupt+0x41/0x60
[<ffffffff815936ab>] apic_timer_interrupt+0x6b/0x70
<EOI>
[<ffffffff81097b41>] ? update_blocked_averages+0x381/0x860
[<ffffffff810f9f98>] ? multi_cpu_stop+0x98/0xf0
[<ffffffff8100272a>] ? __switch_to+0x26a/0x590
[<ffffffff810f9f00>] ? cpu_stop_should_run+0x50/0x50
[<ffffffff810fa1c8>] cpu_stopper_thread+0x88/0x110
[<ffffffff8158e4e8>] ? __schedule+0x338/0x900
[<ffffffff81085575>] smpboot_thread_fn+0x135/0x190
[<ffffffff81085440>] ? sort_range+0x30/0x30
[<ffffffff810820cb>] kthread+0xdb/0x100
[<ffffffff81081ff0>] ? kthread_create_on_node+0x180/0x180
[<ffffffff81592c5f>] ret_from_fork+0x3f/0x70
[<ffffffff81081ff0>] ? kthread_create_on_node+0x180/0x180
Code: 31 c0 e8 a0 c2 52 00 e9 39 ff ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 f7 53 48 83 ec 48 <48> 8b 1d 00 f2 fa 7e 65 44 8b 2d 70 b8 fa 7e 9c 58 0f 1f 44 00
---->8

I can provide full logs if desired.

Anyways, based on my limited understanding of what's going on here, I
tried disabling interrupts a bit further out in the cpu_stopper_thread
code, to try and avoid that apic_timer_interrupt during the clocksource
change. change_clocksource already disables interrupts, but, from the
backtraces, it appears that we're taking an interrupt after the stopper
thread wakes up, but before change_clocksource actually gets called.

I used this patch to try and avoid taking a apic_timer_interrupt at any
point during the clocksource change.:

8<---
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index fd643d8..8502521 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -417,8 +417,11 @@ static void cpu_stopper_thread(unsigned int cpu)
{
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
struct cpu_stop_work *work;
+ unsigned long flags;
int ret;

+ local_irq_save(flags);
+
repeat:
work = NULL;
spin_lock_irq(&stopper->lock);
@@ -452,6 +455,8 @@ repeat:
cpu_stop_signal_done(done, true);
goto repeat;
}
+
+ local_irq_restore(flags);
}

extern void sched_set_stop_task(int cpu, struct task_struct *stop);
--->8

This was a bit of a shot in the dark, but it has worked on every kernel
and every machine we've tried it on so far.

I'm hoping to get some input from the experts in this area, first of
all, on whether the problem I'm seeing is actually what I think it is,
and, if so, if I've solved it in the correct way.

Any input is greatly appreciated!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/