Re: [PATCH] HWPOISON: add a pr_err message when forcibly send a sigbus

From: Will Deacon
Date: Wed Aug 30 2023 - 18:18:50 EST

Next message: Waiman Long: "Re: [PATCH-cgroup v7 0/6] cgroup/cpuset: Support remote partitions"
Previous message: Justin Stitt: "Re: [PATCH 2/2] ocfs2: Replace strlcpy with strscpy"
In reply to: Shuai Xue: "Re: [PATCH] HWPOISON: add a pr_err message when forcibly send a sigbus"
Next in thread: Shuai Xue: "Re: [PATCH] HWPOISON: add a pr_err message when forcibly send a sigbus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Aug 28, 2023 at 09:41:55AM +0800, Shuai Xue wrote:
> On 2023/8/22 09:15, Shuai Xue wrote:
> > On 2023/8/21 18:50, Will Deacon wrote:
> >>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> >>> index 3fe516b32577..38e2186882bd 100644
> >>> --- a/arch/arm64/mm/fault.c
> >>> +++ b/arch/arm64/mm/fault.c
> >>> @@ -679,6 +679,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
> >>> } else if (fault & (VM_FAULT_HWPOISON_LARGE | VM_FAULT_HWPOISON)) {
> >>> unsigned int lsb;
> >>>
> >>> + pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> >>> + current->comm, current->pid, far);
> >>> lsb = PAGE_SHIFT;
> >>> if (fault & VM_FAULT_HWPOISON_LARGE)
> >>> lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
> >>
> >> Hmm, I'm not convinced by this. We have 'show_unhandled_signals' already,
> >> and there's plenty of code in memory-failure.c for handling poisoned pages
> >> reported by e.g. GHES. I don't think dumping extra messages in dmesg from
> >> the arch code really adds anything.
> >
> > I see the show_unhandled_signals() will dump the stack but it rely on
> > /proc/sys/debug/exception-trace be set.
> >
> > The memory failure is the top issue in our production cloud and also other hyperscalers.
> > We have received complaints from our operations engineers and end users that processes
> > are being inexplicably killed :(. Could you please consider add a message?

I don't have any objection to logging this stuff somehow, I'm just not
convinced that the console is the best place for that information in 2023.
Is there really nothing better?

Will

Next message: Waiman Long: "Re: [PATCH-cgroup v7 0/6] cgroup/cpuset: Support remote partitions"
Previous message: Justin Stitt: "Re: [PATCH 2/2] ocfs2: Replace strlcpy with strscpy"
In reply to: Shuai Xue: "Re: [PATCH] HWPOISON: add a pr_err message when forcibly send a sigbus"
Next in thread: Shuai Xue: "Re: [PATCH] HWPOISON: add a pr_err message when forcibly send a sigbus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]