Re: [RFC] Linux memory error handling

From: Maciej W. Rozycki
Date: Thu Jun 16 2005 - 14:43:44 EST


On Wed, 15 Jun 2005, Russ Anderson wrote:

> > This is highly undesirable if the same interrupt is used for MBEs. A
> > page that causes an excessive number of SBEs should rather be removed from
> > the available pool instead.
>
> As a practical point I think you are right that if there are enough
> SBEs to cause a performance hit, migrating the data to a different
> physical page would be a prudent thing to do. But that functionality
> hasn't been implemented yet.

There's another point actually -- if a memory location (here meaning a
single entity covered by ECC; usually a 32-bit word or a 64-bit
doubleword) is causing SBEs excessively, then a bit there probably
somewhat less reliable due to wear, damage, etc. But the normal memory
error statistics apply to other bits at that location as usually, so now
the probability of an MBE is greater. So you'd better keep your data
away.

> That may not always be the right setting for all customers.
> One possible way to deal with that would be to have different
> threshold settings for logging and page migration. That would
> provide flexibility.

Certainly.

> > Logging should probably take recent events
> > into account anyway and take care of not overloading the system, e.g. by
> > keeping only statistical data instead of detailed information about each
> > event under load.
>
> That's what the SBE thresholding does. It avoids overloading the
> system by switching from interrupt mode to periodic polling
> mode, where detailed information can get dropped.

But as I mentioned you have to be careful not to switch the MBE interrupt
off as a side effect. Actually the overhead for handling the interrupt
shouldn't be that high, unless the error triggers in the handler itself.

> > Note we have some infrastructure for that in the MIPS port -- we kill the
> > triggering process, but we don't mark the problematic memory page as
> > unusable (which is an area for improvement).
>
> Mips has some nice features when it comes to error recovery.

Starting with synchronous reporting when possible. :-)

Maciej
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/