Re: [PATCH 5/5] mce: recover from "action required" errors reportedin data path in usermode

From: Chen Gong
Date: Wed Sep 07 2011 - 12:47:57 EST


ä 9/7/2011 9:25 PM, Borislav Petkov åé:
On Wed, Sep 07, 2011 at 02:05:38AM -0400, Chen Gong wrote:

[..]

+ /* known AR MCACODs: */
+ MCESEV(
+ KEEP, "HT thread notices Action required: data load error",
+ SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
+ MCGMASK(MCG_STATUS_EIPV, 0)
+ ),
+ MCESEV(
+ AR, "Action required: data load error",
+ SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
+ USER
+ ),

I don't think *AR* makes sense here because the following codes have a
assumption that it means *user space* condition. If so, in the future a
new *AR* severity for kernel usage is created, we can't distinguish
which one can call "memory_failure" as below. At least, it should have a
suffix such as AR_USER/AR_KERN:

enum severity_level {
MCE_NO_SEVERITY,
MCE_KEEP_SEVERITY,
MCE_SOME_SEVERITY,
MCE_AO_SEVERITY,
MCE_UC_SEVERITY,
MCE_AR_USER_SEVERITY,
MCE_AR_KERN_SEVERITY,
MCE_PANIC_SEVERITY,
};

Are you saying you need action required handling for when the data load
error happens in kernel space? If so, I don't see how you can replay the
data load (assuming this is a data load from DRAM). In that case, we're
fatal and need to panic. If it is a different type of data load coming
from a lower cache level, then we could be able to recover...?

[..]


Yep, what I talk is data load error in kenel space. In fact, I'm not sure what we can do except panic :-), IIRC, Tony ever said in some situations kernel can be recovered. If it is true, we must distinguish
these two different scenarios. In *user space* case, memory_failure can
be called, but on the contrary, it can't.

+ if (worst == MCE_AR_SEVERITY) {
+ unsigned long pfn = m.addr>> PAGE_SHIFT;
+
+ pr_err("Uncorrected hardware memory error in user-access at %llx",
+ m.addr);

print in the MCE handler maybe makes a deadlock ? say, when other CPUs
are printing something, suddently they received MCE broadcast from
Monarch CPU, when Monarch CPU runs above codes, a deadlock happens ?
Please fix me if I miss something :-)

sounds like it can happen if the other CPUs have grabbed some console
semaphore/mutex (I don't know what exactly we're using there) and the
monarch tries to grab it.

+ if (__memory_failure(pfn, MCE_VECTOR, 0)< 0) {
+ pr_err("Memory error not recovered");
+ force_sig(SIGBUS, current);
+ } else
+ pr_err("Memory error recovered");
+ }

as you mentioned in the comment, the biggest concern is that when
__memory_failure runs too long, if another MCE happens at the same
time, (assuming this MCE is happened on its sibling CPU which has the
same banks), the 2nd MCE will crash the system. Why not delaying the
process in a safer context, such as using user_return_notifer ?

The user_return_notifier won't work, as we concluded in the last
discussion round: http://marc.info/?l=linux-kernel&m=130765542330349

AFAIR, we want to have a realtime thread dealing with that recovery
so that we exit #MC context as fast as possible. The code then should
be able to deal with a follow-up #MC. Tony, whatever happened to that
approach?

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/