Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core
From: Borislav Petkov
Date: Wed Sep 09 2020 - 08:21:07 EST
On Tue, Sep 01, 2020 at 04:20:54PM +0000, Shiju Jose wrote:
> CPU CEC derived the infrastructure of the CEC only and the logic
> used in the CEC for CE count storage, CE count calculation and page
> isolation is very unique for the memory pages, which seems cannot be
> reusable for the CPU CEs.
Oh, because it saves the reported error's PFN and you want to save
[CPU num | error count]
?
Well, you can easily change that by extending the existing CEC to have a
different storage format for CPU errors, i.e., use a different ce_array
which gets passed to the functions anyway.
> Also the values set for the parameters such as threshold, time period
> for the memory errors and CPU errors would be different.
And your implementation with sliding windows is so totally different
that it warrants the duplication of the code? I don't think so.
You can use the current CEC to do exactly what you wanna do, with the
decaying and so on.
Because all you wanna do is count the errors a CPU triggered.
However, a CPU can trigger a *lot* of different types of errors.
You're putting them all in the same basket by doing:
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM))
/* add to CEC */
and only for correctable.
What type of errors get reported in CPER_SEC_PROC_ARM?
If they're all lumped together and if some functional unit generates a
lot of errors, instead of disabling that unit only, you'll go and remove
the whole CPU?
Doesn't make a whole lot of sense to me.
How about you define what exactly you're trying to solve, maybe give an
example of a real issue someone is encountering and you're trying to
address? Because there was never a necessity so far to disable CPUs on
x86 due to correctable errors. Why is that needed on ARM?
> Thus extending cec.c to support CPU CEs would include adding CPU CEC
> specific code for storing error count, isolation etc which I thought
> would result the code less tidy and less readable unless find more
> reusable logic.
Depends on how you design it.
But with what I'm seeing so far, I'm still sceptical this is needed at
all.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette