Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

From: Luck, Tony
Date: Fri Apr 19 2019 - 14:26:44 EST


On Fri, Apr 19, 2019 at 02:29:11AM +0200, Borislav Petkov wrote:
> On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote:
> > On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> > > Which reminds me, Tony, I think all those debugging files "pfn"
> > > and "array" and the one you add now, should all be under a
> > > CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
> > > Mind adding that too pls?
> >
> > Patch below, on top of previous patch. Note that I didn't move "enable"
> > into the RAS_CEC_DEBUG code. I think it has some value even on
> > production systems.
>
> And that value is? Usecase?

Suppose that an entire device on a DIMM fails. Systems with the
right type of DIMM (X4) and a memory controller that implements
https://en.wikipedia.org/wiki/Chipkill (Intel calls this "SDDC")
can continue running ... but there will be a lot of corrected
errors from a vast range of different pages.

After fifteen or so errors Linux will trigger storm mode and
the user will see:

mce: CMCI storm detected: switching to poll mode

on the console. As we poll we'll find errors and hand them
to CEC. But because the errors come from far more than 512
distinct pages CEC will never manage to get a count above 1
before it drops the entry to make space for a new log.

So the only indication that the user sees that something is
wrong is that storm warning (and the lack of a following
"storm subsided" message) tells them that errors are still
happening.

This amounts to a serviceability failure ... lack of useful
diagnostics about a problem.

Now there isn't really anything better that CEC can do in
this situation. It won't help to have a bigger array. Taking
pages offline wouldn't solve the problem (though if that
did happen at least it would break the silence).

Same situation for other DRAM failure modes that affect a
wide range of pages (rank, bank, perhaps row ... though all
the errors from a single row failure might fit in the CEC array).

Allowing the user to bypass CEC (without a reboot ... cloud folks
hate to reboot their systems) would allow the sysadmin to see
what is happening (either via /dev/mcelog, or via EDAC driver).

-Tony