Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

From: Dave Chinner
Date: Mon Jun 12 2023 - 01:16:47 EST


On Sun, Jun 11, 2023 at 08:14:25PM -0700, Linus Torvalds wrote:
> On Sun, Jun 11, 2023 at 7:22 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > I guess the regression fix needs a regression fix....
>
> Yup.
>
> From the description of the problem, it sounds like this happens on
> real hardware, no vhost anywhere?
>
> Or maybe Darrick (who doesn't see the issue) is running on raw
> hardware, and you and Zorro are running in a virtual environment?

I'm testing inside VMs and seeing it, I can't speak for anyone else.

....

> So *maybe* this attached patch might fix it? I haven't thought very
> deeply about this, but vhost workers most definitely shouldn't call
> do_coredump(), since they are then not counted.
>
> (And again, I think we should just check that PF_IO_WORKER bit, not
> use this more complex test, but that's a separate and bigger change).
>
> Linus

> kernel/signal.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 2547fa73bde5..a1e11ee8537c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2847,6 +2847,10 @@ bool get_signal(struct ksignal *ksig)
> */
> current->flags |= PF_SIGNALED;
>
> + /* vhost workers don't participate in core dups */
> + if ((current->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
> + goto out;
> +
> if (sig_kernel_coredump(signr)) {
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);


That would appear to make things worse. mkfs.xfs hung in Z state on
exit and never returned to the shell. Also, multiple processes are
livelocked like this:

Sending NMI from CPU 0 to CPUs 1-3:
NMI backtrace for cpu 2
CPU: 2 PID: 3409 Comm: pmlogger_farm Not tainted 6.4.0-rc5-dgc+ #1822
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:uprobe_deny_signal+0x5/0x90
Code: 48 c7 c1 c4 64 62 82 48 c7 c7 d1 64 62 82 e8 b2 39 ec ff e9 70 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 <55> 31 4
RSP: 0018:ffffc900023abdf0 EFLAGS: 00000202
RAX: 0000000000000004 RBX: ffff888103b127c0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000296 RDI: ffffc900023abe70
RBP: ffffc900023abe60 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: ffff88813bd2ccf0 R12: ffff888103b127c0
R13: ffffc900023abe70 R14: ffff888110413700 R15: ffff888103d26e80
FS: 00007f35497a4740(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00007ffd4ca0ce80 CR3: 000000010f7d1000 CR4: 00000000000006e0
Call Trace:
<NMI>
? show_regs+0x61/0x70
? nmi_cpu_backtrace+0x88/0xf0
? nmi_cpu_backtrace_handler+0x11/0x20
? nmi_handle+0x57/0x150
? default_do_nmi+0x49/0x240
? exc_nmi+0xf4/0x110
? end_repeat_nmi+0x16/0x31
? uprobe_deny_signal+0x5/0x90
? uprobe_deny_signal+0x5/0x90
? uprobe_deny_signal+0x5/0x90
</NMI>
<TASK>
? get_signal+0x94/0x9b0
? signal_setup_done+0x66/0x190
arch_do_signal_or_restart+0x2f/0x260
exit_to_user_mode_prepare+0x181/0x1c0
syscall_exit_to_user_mode+0x16/0x40
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0023:0xffff888103b127c0
Code: Unable to access opcode bytes at 0xffff888103b12796.
RSP: 002b:00007ffd4ca0d0ac EFLAGS: 00000202 ORIG_RAX: 000000000000003d
RAX: 0000000000000009 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00007ffd4d20bb9c RDI: 00000000ffffffff
RBP: 00007ffd4d20bb9c R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffd4d20bba0 R14: 00005604571fc380 R15: 0000000000000001
</TASK>
NMI backtrace for cpu 3
CPU: 3 PID: 3526 Comm: pmlogger_check Not tainted 6.4.0-rc5-dgc+ #1822
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:fixup_exception+0x72/0x260
Code: 14 0f 87 03 02 00 00 ff 24 d5 98 67 22 82 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 41 81 cd 00 00 00 40 4d 63 ed 4d 89 6c 24 50 <31> c0 9
RSP: 0018:ffffc9000275bb58 EFLAGS: 00000083
RAX: 000000000000000f RBX: ffffffff827d0a4c RCX: ffffffff810c5f95
RDX: 000000000000000f RSI: ffffffff827d0a4c RDI: ffffc9000275bb28
RBP: ffffc9000275bb80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000275bc78
R13: 000000000000000e R14: 000000008f5ded3f R15: 0000000000000000
FS: 00007f56a36de740(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 000000008f5ded3f CR3: 000000010dcde000 CR4: 00000000000006e0
Call Trace:
<NMI>
? show_regs+0x61/0x70
? nmi_cpu_backtrace+0x88/0xf0
? nmi_cpu_backtrace_handler+0x11/0x20
? nmi_handle+0x57/0x150
? default_do_nmi+0x49/0x240
? exc_nmi+0xf4/0x110
? end_repeat_nmi+0x16/0x31
? copy_fpstate_to_sigframe+0x1c5/0x3a0
? fixup_exception+0x72/0x260
? fixup_exception+0x72/0x260
? fixup_exception+0x72/0x260
</NMI>
<TASK>
kernelmode_fixup_or_oops+0x49/0x120
__bad_area_nosemaphore+0x15a/0x230
? __bad_area+0x57/0x80
bad_area_nosemaphore+0x16/0x20
exc_page_fault+0x323/0x880
asm_exc_page_fault+0x27/0x30
RIP: 0010:copy_fpstate_to_sigframe+0x1c5/0x3a0
Code: 45 89 bc 24 40 25 00 00 f0 41 80 64 24 01 bf e9 f5 fe ff ff be 3c 00 00 00 48 c7 c7 77 9c 5f 82 e8 00 2a 23 00 31 c0 0f 1f 00 <49> 0f 1
RSP: 0018:ffffc9000275bd28 EFLAGS: 00010246
RAX: 000000000000000e RBX: 000000008f5de7ec RCX: ffffc9000275bda8
RDX: 000000008f5ded40 RSI: 000000000000003c RDI: ffffffff825f9c77
RBP: ffffc9000275bd98 R08: ffffc9000275be30 R09: 0000000000000001
R10: 0000000000000000 R11: ffffc90000138ff8 R12: ffff8881106527c0
R13: 000000008f5deb40 R14: ffff888110654d40 R15: ffff88810a653f40
? copy_fpstate_to_sigframe+0x1c0/0x3a0
? __might_sleep+0x42/0x70
get_sigframe+0xcd/0x2b0
ia32_setup_frame+0x61/0x230
arch_do_signal_or_restart+0x1d1/0x260
exit_to_user_mode_prepare+0x181/0x1c0
irqentry_exit_to_user_mode+0x9/0x30
irqentry_exit+0x33/0x40
exc_page_fault+0x1b6/0x880
asm_exc_page_fault+0x27/0x30
RIP: 0023:0x106527c0
Code: Unable to access opcode bytes at 0x10652796.
RSP: 002b:000000008f5ded6c EFLAGS: 00010202
RAX: 000000000000000b RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00007ffd8f5df2ec RDI: 00000000ffffffff
RBP: 00007ffd8f5df2ec R08: 0000000000000000 R09: 00005558962eb526
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffd8f5df2f0 R14: 00005558962b5e60 R15: 0000000000000001
</TASK>


Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx