Re: AMD EPYC 25 (19h): Hardware Error: Machine Check: 0 Bank 17: d42040000000011b

From: Borislav Petkov
Date: Fri Apr 14 2023 - 06:24:14 EST


On Fri, Apr 14, 2023 at 11:26:27AM +0200, Paul Menzel wrote:
> It says “no action required”,

Yes, it means you had a single bit flip in some DIMM and it got
corrected by the ECC so you don't need to do anything.

> but out of the identical 14 servers with the same workload this is the
> only one having shown this errors three times.

Or you could enable CONFIG_RAS_CEC and don't see those errors anymore.

It all depends: a DIMM could be producing correctable errors for a long
time before going bad. If ever. If you don't want to risk whatever
you're running on that machine by a DIMM *potentially* going bad, sure,
you can replace it. That's a budget call. :)

> Maybe the DIMM at bank 17 should just be replaced.

Bank 17 is the CPU MCA bank which reports the error - not a DIMM bank.
In order to pinpoint the location, you should have amd64_edac loaded so
that it decodes to which DIMM. You could try loading that module and
injecting all errors you have to see what it says, it should work this
way too as the error signature has everything needed for decoding,
AFAICT.

But Yazen can chime in here if I'm off.

HTH.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette