Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

From: Borislav Petkov
Date: Sat Apr 20 2019 - 15:05:10 EST


On Sat, Apr 20, 2019 at 11:18:46AM -0700, Cong Wang wrote:
> You didn't answer my question here, because I asked you whether
> the following change (PoC only) makes sense:

I answered it - the answer is to disable CONFIG_RAS_CEC. But let me do a
more detailed answer, maybe that'll help.

The PoC doesn't make sense.

Why?

Because if you don't return early from the notifier when the CEC has
consumed the error, you don't need the CEC at all. Ergo, you can just as
well disable it.

Because, let me paste from a couple of mails ago what the CEC is:

"CEC is something *completely* different and its purpose is to run in
the kernel and prevent users and admins from upsetting unnecessarily
with every sporadic correctable error and just because an alpha particle
flew through their DIMMs, they all start running in headless chicken
mode, trying to RMA perfectly good hardware."

IOW, when you have the CEC enabled, you don't need to log memory errors
with a userspace agent. The CEC collects them and discards them if they
don't repeat.

If they do repeat, then it offlines the page.

Without user intervention and interference.

Now, if you still want to know how many errors and where they happened
and when they happened and yadda yadda, you *disable* the CEC.

I hope this makes more sense now.

> I knew disabling it could cure the problem from the beginning, please
> save your own time by not repeating things we both already knew. :)
>
> Once again, I still don't think it is the right answer, which is also why I
> keep finding different solutions.

This is where you come in and say "it is not the right answer
because..." and give your arguments why. I gave mine a couple of times
already. I never said this functionality is cast in stone the way it is
but there has to be a *good* *reason* why it needs to be changed. I.e.,
basic kernel deveopment. People come with ideas and they *justify* those
ideas with arguments why they're better.

> I know you disagree, but you never explain why you disagree,

You're kidding, right?

https://lkml.kernel.org/r/20190419002645.GA559@xxxxxxx

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.