Re: [RFC 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful

From: Petr Mladek
Date: Tue Jul 09 2019 - 12:05:36 EST


On Mon 2019-06-17 23:08:54, Thomas Gleixner wrote:
> On Wed, 5 Jun 2019, Petr Mladek wrote:
>
> > Hi,
> >
> > we were analyzing logs with several softlockup reports in flush_tlb_kernel_range().
> > They were confusing. Especially it was not clear whether it was deadlock,
> > livelock, or separate softlockups.
> >
> > It went out that even a simple busy loop:
> >
> > while (true)
> > cpu_relax();
> >
> > is able to produce several softlockups reports:
> >
> > [ 168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
> > [ 196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
> > [ 236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]
> >
> >
> > I tried to understand the tricky watchdog code and produced two patches
> > that would be helpful to debug the original real bug:
> >
> > 1st patch prevents restart of the watchdog from unrelated locations.
> >
> > 2nd patch helps to distinguish several possible situations by
> > regular reports.
> >
> > 3rd patch can be used for testing the problem.
> >
> >
> > The watchdog code might deserve even more clean up. Anyway, I would
> > like to hear other's opinion first.
>
> Anything which improves debugability is welcome. Unfortunately you missed
> to add an example of the output after these patches are applied.

After the 1st patch only one message is displayed. It is most likely
the original author expectation:

[ 480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]


But I think that more useful is to inform regularly that the problem
still exists. So after the 2nd patch we get:

[ 480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
[ 508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
[ 548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
[ 576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]


Note that the above lines always appear on the console because
they are printed with KERN_EMERG loglevel.

The log buffer contains more information:

[ 480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
[ 480.374624] Modules linked in: livepatch_sample(EK)
[ 480.375945] irq event stamp: 64036
[ 480.376998] hardirqs last enabled at (64035): [<ffffffffb7001ab0>] trace_hardirqs_on_thunk+0x1a/0x1c
[ 480.379363] hardirqs last disabled at (64036): [<ffffffffb7001acc>] trace_hardirqs_off_thunk+0x1a/0x1c
[ 480.381708] softirqs last enabled at (64034): [<ffffffffb7e0034e>] __do_softirq+0x34e/0x3fd
[ 480.383916] softirqs last disabled at (64027): [<ffffffffb714c430>] irq_exit+0xc0/0xd0
[ 480.385956] CPU: 2 PID: 4943 Comm: cat Kdump: loaded Tainted: G E K 5.2.0-rc2-default+ #4894
[ 480.387898] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
[ 480.390030] RIP: 0010:version_proc_show+0x3b/0x80
[ 480.391428] Code: 48 89 fb 48 c7 c7 b0 54 36 b8 e8 83 8a dd ff c6 05 a1 ac 40 02 01 0f b6 05 9a ac 40 02 84 c0 74 0d f3 90 0f b6 05 8d ac 40 02 <84> c0 75 f3 65 48 8b 04 25 00 7f 01 00 48 8b 80 c8 0a 00 00 48 89
[ 480.396442] RSP: 0018:ffffa78c407ffdf8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 480.398977] RAX: 0000000000000001 RBX: ffff973e77976b60 RCX: 0000000000000000
[ 480.401441] RDX: 0000000000000000 RSI: ffff973e7bb18cd8 RDI: ffff973e7bb18cd8
[ 480.403879] RBP: ffffa78c407ffe00 R08: 0000000000000000 R09: 0000000000000000
[ 480.406331] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 480.408779] R13: ffffa78c407ffee8 R14: ffff973e72483600 R15: ffff973e77976b60
[ 480.411212] FS: 00007f49c93a4700(0000) GS:ffff973e7bb00000(0000) knlGS:0000000000000000
[ 480.413892] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 480.416004] CR2: 00007f49c920d008 CR3: 00000001378e2001 CR4: 00000000001606e0
[ 480.418456] Call Trace:
[ 480.419778] seq_read+0xe2/0x3e0
[ 480.421297] proc_reg_read+0x3e/0x60
[ 480.423122] __vfs_read+0x1b/0x40
[ 480.425014] vfs_read+0x8e/0x130
[ 480.426533] ksys_read+0xaa/0xe0
[ 480.428039] ? lockdep_hardirqs_on+0xf6/0x190
[ 480.429855] __x64_sys_read+0x1a/0x20
[ 480.431453] do_syscall_64+0x55/0x1c0
[ 480.433016] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 480.434919] RIP: 0033:0x7f49c8ed4270
[ 480.436466] Code: 0b 31 c0 48 83 c4 08 e9 be fe ff ff 48 8d 3d 7f f7 08 00 e8 32 cb 01 00 66 90 83 3d f9 54 2c 00 00 75 10 b8 00 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8c 01 00 48 89 04 24
[ 480.442231] RSP: 002b:00007ffd49795a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 480.444760] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f49c8ed4270
[ 480.447247] RDX: 0000000000020000 RSI: 00007f49c920e000 RDI: 0000000000000003
[ 480.449696] RBP: 00007f49c920e000 R08: 00000000ffffffff R09: 0000000000000000
[ 480.452168] R10: 000000007c9d4d41 R11: 0000000000000246 R12: 00007f49c920e000
[ 480.454225] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[ 508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
[ 508.374581] Modules linked in: livepatch_sample(EK)
[ 508.375866] irq event stamp: 135678
[ 508.376930] hardirqs last enabled at (135677): [<ffffffffb7001ab0>] trace_hardirqs_on_thunk+0x1a/0x1c
[ 508.379066] hardirqs last disabled at (135678): [<ffffffffb7001acc>] trace_hardirqs_off_thunk+0x1a/0x1c
[ 508.382676] softirqs last enabled at (135676): [<ffffffffb7e0034e>] __do_softirq+0x34e/0x3fd
[ 508.384705] softirqs last disabled at (135663): [<ffffffffb714c430>] irq_exit+0xc0/0xd0
[ 508.386640] CPU: 2 PID: 4943 Comm: cat Kdump: loaded Tainted: G ELK 5.2.0-rc2-default+ #4894
[ 508.388819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
[ 508.391198] RIP: 0010:version_proc_show+0x3b/0x80
[ 508.392555] Code: 48 89 fb 48 c7 c7 b0 54 36 b8 e8 83 8a dd ff c6 05 a1 ac 40 02 01 0f b6 05 9a ac 40 02 84 c0 74 0d f3 90 0f b6 05 8d ac 40 02 <84> c0 75 f3 65 48 8b 04 25 00 7f 01 00 48 8b 80 c8 0a 00 00 48 89
[ 508.397427] RSP: 0018:ffffa78c407ffdf8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 508.399223] RAX: 0000000000000001 RBX: ffff973e77976b60 RCX: 0000000000000000
[ 508.401122] RDX: 0000000000000000 RSI: ffff973e7bb18cd8 RDI: ffff973e7bb18cd8
[ 508.403060] RBP: ffffa78c407ffe00 R08: 0000000000000000 R09: 0000000000000000
[ 508.405002] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 508.406898] R13: ffffa78c407ffee8 R14: ffff973e72483600 R15: ffff973e77976b60
[ 508.408662] FS: 00007f49c93a4700(0000) GS:ffff973e7bb00000(0000) knlGS:0000000000000000
[ 508.410580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 508.412099] CR2: 00007f49c920d008 CR3: 00000001378e2001 CR4: 00000000001606e0
[ 508.413861] Call Trace:
[ 508.414834] seq_read+0xe2/0x3e0
[ 508.415918] proc_reg_read+0x3e/0x60
[ 508.417094] __vfs_read+0x1b/0x40
[ 508.418216] vfs_read+0x8e/0x130
[ 508.419297] ksys_read+0xaa/0xe0
[ 508.420390] ? lockdep_hardirqs_on+0xf6/0x190
[ 508.421667] __x64_sys_read+0x1a/0x20
[ 508.422798] do_syscall_64+0x55/0x1c0
[ 508.423931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 508.425336] RIP: 0033:0x7f49c8ed4270
[ 508.426478] Code: 0b 31 c0 48 83 c4 08 e9 be fe ff ff 48 8d 3d 7f f7 08 00 e8 32 cb 01 00 66 90 83 3d f9 54 2c 00 00 75 10 b8 00 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8c 01 00 48 89 04 24
[ 508.430666] RSP: 002b:00007ffd49795a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 508.432579] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f49c8ed4270
[ 508.434360] RDX: 0000000000020000 RSI: 00007f49c920e000 RDI: 0000000000000003
[ 508.436265] RBP: 00007f49c920e000 R08: 00000000ffffffff R09: 0000000000000000
[ 508.438264] R10: 000000007c9d4d41 R11: 0000000000000246 R12: 00007f49c920e000
[ 508.440233] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff


I believe that we could afford it every 26 seconds. They are
useful to distinguish hardlock from live-lock. They are limited
because the system either recovers or it has to get rebooted
anyway.

Best Regards,
Petr