RE: [PATCH 8/8] ACPI / trace: Add trace interface for eMCA driver

From: Luck, Tony
Date: Wed Oct 16 2013 - 16:47:16 EST


> Also, I suspect that, if an error happens to affect more than one DIMM
> (e. g. part of the location is not available for a given error),
> that the DIMM label will also not be properly shown.

There are a couple of cases here:

1) There are a number of DIMMs behind some flaky h/w that introduces errors
that are apparently blamed onto each of those DIMMs.

All we can do here is statistical correlations ... each error is reported independently,
it is up to some entity to notice the higher level topology connection. There is enough
information in the UEFI error record to do that (assuming that BIOS filled out the
necessary fields).

2) There is a single reported error that spans more than one DIMM.

This can happen with a UC error in a pair of lock-step DIMMs. Since the error is UC
we know that two (or more) bits are bad. But we have no way to tell whether the
bad bits came from the same DIMM, or one bit from each (because we don't know
which bits are bad - if we knew that, we could fix them :-) The eMCA case should
log two subsections in this case - one for each of the lockstep DIMMs involved. A user
seeing this will should probably just replace both DIMMs to be safe. If they wanted to
diagnose further they should swap DIMMs around so this pair are no longer lockstepped
and see if they start seeing correctable errors from each of the split pair - or if the UC
errors move with one or the other of the DIMMs

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/