Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

From: Borislav Petkov
Date: Fri Apr 02 2021 - 13:07:39 EST


On Fri, Apr 02, 2021 at 06:00:42PM +0200, William Roche wrote:
> Corrected Errors are not the best indicators for a failing DIMM

In the OS, errors reported through different mechanisms is all we have.

> For the moment we will have the CE MCE handled my the MCE_HANDLED_CEC
> aware notifiers only when a page is off-lined, like it used to be.
>
> Can we start with that small fix ?

Sure but do two variables pls - an "err" one which catches the
function's retval and a "ret" one which ce_add_elem() itself returns so
that there's no confusion like it was before:

---
diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index ddecf25b5dd4..b926c679cdaf 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -312,8 +312,8 @@ static bool sanity_check(struct ce_array *ca)
static int cec_add_elem(u64 pfn)
{
struct ce_array *ca = &ce_arr;
+ int count, err, ret = 0;
unsigned int to = 0;
- int count, ret = 0;

/*
* We can be called very early on the identify_cpu() path where we are
@@ -330,8 +330,8 @@ static int cec_add_elem(u64 pfn)
if (ca->n == MAX_ELEMS)
WARN_ON(!del_lru_elem_unlocked(ca));

- ret = find_elem(ca, pfn, &to);
- if (ret < 0) {
+ err = find_elem(ca, pfn, &to);
+ if (err < 0) {
/*
* Shift range [to-end] to make room for one more element.
*/

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette