Re: noisy edac

From: Dave Peterson
Date: Mon Jan 30 2006 - 19:51:36 EST


On Monday 30 January 2006 16:02, Gunther Mayer wrote:
> Just printk() the exact driver specific low-level error, even if non-fatal.
>
> Single non-fatal errors just show your system recovers correctly.
>
> Multiple (e.g. noisy) non-fatal are either an indication of a serious
> problem
> (e.g. after how many corrected ECC errors on the same address in which
> time interval will you replace your dimm? How many S-ATA CRC-errors
> will indicate marginal bad cabling? )
> or it shows the problem needs to be root analyzed. But don't disable the
> messages as this will only hide the real problem.
>
> Concerning Non-Fatal PCI Express errors, the error cause registers need
> to be printed in case of error, too (see Intel Chipset Specifications)

I agree that in general, the kernel should not be silent when errors are
detected. However, for a particular type of error, it may be that the
user is aware of the error (because (s)he has seen the messages), the user
has determined the root cause, and it turns out that the error is benign.
Therefore the user may want to suppress further messages of this type to
avoid cluttering the logs. If you don't provide that option to the user,
then it can be viewed as hardcoding a certain aspect of sysadmin policy
into the kernel.

Depending on the particular type of error, it may be appropriate to just
offer the user two options: either printk() or be silent. For other types
of errors, it may make sense to give the user more than two options (for
instance ignore, printk(), or panic()). I think developers of chipset
drivers can make this decision individually for each type of error.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/