Re: Opteron ECC/ChipKill error

From: Borislav Petkov
Date: Tue Feb 08 2011 - 08:49:24 EST


On Tue, Feb 08, 2011 at 02:30:11PM +0100, martin f krafft wrote:
> Dear list,
>
> I just got to see the following message on my Opteron server:
>
> kernel: [810137.744689] Northbridge Error, node 1
> kernel: [810137.756250] ECC/ChipKill ECC error.
> kernel: [810137.766975] EDAC amd64 MC1: CE ERROR_ADDRESS= 0x26bdd40f0
> kernel: [810137.766982] EDAC MC1: CE page 0x26bdd4, offset 0xf0, grain 0, syndrome 0xe1e2, row 6, channel 1, label "": amd64_edac
>
> Is there any way to deduce from these data the actual
> culprit/component to replace?

It is a DRAM ECC error on one of the DIMMs on your node 1. If it is a
single occurrence I wouldn't start to worry yet - I'd monitor to see
whether the same row above (row 6) starts increasing its error rate.
Also, sometimes reseating the DIMMs helps.

Can you send your dmesg please?

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/