Re: noisy edac

From: Gunther Mayer
Date: Mon Jan 30 2006 - 19:06:21 EST


Dave Peterson wrote:

On Monday 30 January 2006 15:44, Gunther Mayer wrote:


For each individual type of error that is specific to a particular
low-level chipset driver (e752x, amd76x, etc.) there could be an entry
in the appropriate part of the sysfs hierarchy under the given chipset
driver. This entry could have several settings that the user may choose


from such as { ignore, syslog, panic }. For the implementation, there


could be a generic piece of code in the core EDAC module that a chipset
driver calls into. The generic code would do the dirty work of creating
the sysfs entries (and destroying them when the chipset module is
unloading). How does this sound?


Over-Engineered.



Do you have an alternate suggestion?


Just printk() the exact driver specific low-level error, even if non-fatal.

Single non-fatal errors just show your system recovers correctly.

Multiple (e.g. noisy) non-fatal are either an indication of a serious problem
(e.g. after how many corrected ECC errors on the same address in which
time interval will you replace your dimm? How many S-ATA CRC-errors
will indicate marginal bad cabling? )
or it shows the problem needs to be root analyzed. But don't disable the
messages as this will only hide the real problem.

Concerning Non-Fatal PCI Express errors, the error cause registers need
to be printed in case of error, too (see Intel Chipset Specifications)

-
Gunther


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/