Re: Hot plug vs. reliability

From: Russ Anderson
Date: Thu May 27 2004 - 11:09:51 EST

Next message: Maciej W. Rozycki: "Re: [patch] io-apic cleanup #3, BK-curr"
Previous message: Arthur Perry: "Re: GART error 11 (fwd)"
In reply to: Matthias Fouquet-Lapar: "Re: Hot plug vs. reliability"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Zoltan Menyhart wrote:
>
> We cannot remove safely failing memory / CPUs. In most of the cases
> it is too late.

This is a key point. To get the most value out of hot plug (as
a reliability feature) the system must be able to detect and
"ride through" component failures. Conversly, if the system
crashes on the first component failure, the ability to hot remove
the broken component has little value.

For example, memory hot-plug has the most value if the system can
"ride through" a memory uncorrectable, isolate the bad memory
(ie not re-use the page with the bad DIMM cells), shoot the application
that hit the uncorrectable (or better yet, have some checkpoint/restart
mechanism to avoid killing the application), migrate data off
the physical DIMM (etc) to get the system to the point that the
bad DIMM can be physically replaced, and re-integrate the new memory.

My point is that a key part of the whole hot plug story is the
ability to detect and ride thought the initial errors that would
prompt someone to want to replace the component. And without that
part the significant effort to do the rest of the pieces has significantly
less value.

> We (in the OS) can see some corrected CPU, memory, I/O
> and platform errors. Yet the OS has not got and should not have the
> knowledge when a component is "enough bad". I think it is the firmware
> that has all the information about the details of the HW events.
> Do you know of some firmware services which can say something like:
> "hey, remove the component X otherwise your MTBF will drop by 95 %..." ?

The difficulty with predictive analysis is determining the exact
indicator of a potential failure. Many times the first indication
is a fatal error that crashes the system (which is why error recovery
to "ride through" failures is so important). Other errors, such
as memory singlebits, may (or may not) increase the probability of
failure, but does is increase the probability enough to warrent
a service action? (Service actions have costs, too.)

A technical difficulty with predictive analysis is that each component
has a different failure characteristics and the failure charicteristics
can change with spacific technologies. For example, smaller die
technologies can increase the soft failure rates. And by the
time the long term failure characteristics are fully understood
the technology is obsolete. :-(

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc rja@xxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Maciej W. Rozycki: "Re: [patch] io-apic cleanup #3, BK-curr"
Previous message: Arthur Perry: "Re: GART error 11 (fwd)"
In reply to: Matthias Fouquet-Lapar: "Re: Hot plug vs. reliability"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]