Re: [PATCH v2] x86,mm: print likely CPU at segfault time

From: Ingo Molnar
Date: Thu Aug 04 2022 - 16:17:13 EST



* Rik van Riel <riel@xxxxxxxxxxx> wrote:

> In a large enough fleet of computers, it is common to have a few bad CPUs.
> Those can often be identified by seeing that some commonly run kernel code,
> which runs fine everywhere else, keeps crashing on the same CPU core on one
> particular bad system.
>
> However, the failure modes in CPUs that have gone bad over the years are
> often oddly specific, and the only bad behavior seen might be segfaults
> in programs like bash, python, or various system daemons that run fine
> everywhere else.
>
> Add a printk() to show_signal_msg() to print the CPU, core, and socket
> at segfault time. This is not perfect, since the task might get rescheduled
> on another CPU between when the fault hit, and when the message is printed,
> but in practice this has been good enough to help us identify several bad
> CPU cores.
>
> segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
> CC: Dave Jones <dsj@xxxxxx>
> ---
> arch/x86/mm/fault.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index fad8faa29d04..a9b93a7816f9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, struct task_struct *tsk)
> {
> const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
> + /* This is a racy snapshot, but it's better than nothing. */
> + int cpu = READ_ONCE(raw_smp_processor_id());
>
> if (!unhandled_signal(tsk, SIGSEGV))
> return;
> @@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>
> print_vma_addr(KERN_CONT " in ", regs->ip);
>
> + /*
> + * Dump the likely CPU where the fatal segfault happened.
> + * This can help identify faulty hardware.
> + */
> + printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
> + topology_core_id(cpu), topology_physical_package_id(cpu));

LGTM, applying this to tip:x86/mm unless someone objects.

I've added the tidbit to the changelog that this only gets printed if
show_unhandled_signals (/proc/sys/kernel/print-fatal-signals) is enabled -
which is off by default. So your patch expands upon a default-off debug
printout in essence - where utility maximization is OK.

Thanks,

Ingo