Re: [PATCH 3/6] x86/mce: Add support for new MCA_SYND register

From: Borislav Petkov
Date: Fri Jul 08 2016 - 06:15:11 EST


On Fri, Jul 08, 2016 at 11:46:53AM +0200, Ingo Molnar wrote:
> I'm not sure I can parse that: how can a reported error have bits corrupted?

No, it is about the actual bits in memory the ECC error is generated
for. So, for example, if an ECC error reports that memory location X had
some bit flips, the syndrome value which gets reported together with
same ECC error shows which actual bits have flipped.

Here's an example from the AMD BKDG, maybe that'll make it more clear:

http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf

Go to page 246, there it says this:

"For example, assume the ECC syndrome is 03EAh. First search row EAh
for the complete syndrome. Since it is not found, search row 03h for
the complete syndrome. It is found in column 9h, so symbol 9h has the
error. Since the error bitmask indicates value 3h (0011b), bits 0 and 1
within that symbol are corrupted. Symbol 9h maps to bits 72-79, so the
corrupted bits are 72 and 73 of the line."

So you basically search the table of x8 ECC correctable syndromes, first
in row EAh (second syndrome byte) and if you don't find the complete
syndrome there, you search row 03 for it.

It is in column 9 and that means symbol 9. The symbols are 16 - one
symbol for each byte in a 128bit DRAM word + 3 special symbols for the
ECC bits.

The row number 3h is also the error bitmask, so bits 0 and 1 are the
ones which are corrupted.

Which means, when you look at the value in DRAM at the address the error
was reported, you need to go to symbol 9, that's 9*8 = 72 which means,
bits 72-79 and the first 2 in that byte are bits 72 and 73.

So if you want to correct them, you simply flip them as the syndrome
tells you that those 2 are corrupted.

Ok?

See how easy it is :-)))

> I'm fine with an add-on patch that adds a good explanation for all
> this to the code.

How about we point to that section in the BKDG? I think it is written
pretty understandably for a technical document and the example makes it
even more explicit.

:-)

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.