Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 whenfeature is disabled

From: Ingo Molnar
Date: Fri Jun 12 2009 - 12:49:11 EST



* H. Peter Anvin <hpa@xxxxxxxxx> wrote:

> Ingo Molnar wrote:
> >
> > So i think hwpoison simply does not affect our ability to get
> > log messages out - but it sure allows crappier hardware to be
> > used. Am i wrong about that for some reason?
>
> Crappy hardware isn't the kind of hardware that is likely to have
> the hwpoison features, just like crappy hardware generally doesn't
> even have ECC -- or even basic parity checking (I personally think
> non-ECC memory should be considered a crime against humanity in
> this day and age.)
>
> You're making the fundamental assumption that failover and
> hardware replacement is a relatively cheap and fast operation. In
> high reliability applications, of course, failover is always an
> option -- it *HAS* to be an option -- but that doesn't mean that
> hardware replacement is cheap, fast or even possible -- and now
> you've blown your failover option.
>
> These kinds of features are used when extremely high reliability
> is required, think for example a telco core router. A page error
> may have happened due to stray radiation or through power supply
> glitches (which happen even in the best of systems), but if they
> are a pattern, a box needs to be replaced. *How quickly* a box
> can be taken out of service and replaced can vary greatly, and its
> urgency depend on patterns; furthermore, in the meantime the
> device has to work the best it can.
>
> Consider, for example, a control computer on the Hubble Space
> Telescope -- the only way to replace it is by space shuttle, and
> you can safely guarantee that *that* won't happen in a heartbeat.
> On the new Herschel Space Observatory, not even the space shuttle
> can help: if the computers die, *or* if bad data gets fed to its
> control system, the spacecraft is lost. As such, it's of
> paramount importance for the computers to (a) continue to provide
> service at the level the hardware is capable of doing, (b) as
> accurately as possible continually assess and report that level of
> service, and (c) not allow a failure to pass undetected. A lot of
> failures are simple one-time events (especially in space, a
> high-rad environment), others reflect decaying hardware but can be
> isolated (e.g. a RAM cell which has developed a short circuit, or
> a CPU core which has a damaged ALU), while others yet reflect a
> general ill health of the system that cannot be recovered.
>
> What these kinds of features do is it gives the overall-system
> designers and the administrators more options.

Ok, these arguments are pretty convincing - thanks everyone for the
detailed explanation.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/