Re: [PATCH v4 0/7] x86: BSP or CPU0 online/offline

From: Ingo Molnar
Date: Wed Dec 07 2011 - 17:24:06 EST



* Luck, Tony <tony.luck@xxxxxxxxx> wrote:

> > More importantly, you generally *cannot* realistically
> > continue with a bad CPU anyway - the system will crash or
> > will show signs of corruptions and you *want* a full
> > powerdown and a clean reboot.
>
> See the "Enhanced cache error reporting" section in the Intel
> Software Developers manual (section 15.4 in volume 3B of the
> latest edition). Intel provides what is probably a very early
> notification in many cases that a processors cache is
> experiencing problems. At the time of the notification the
> system is still functioning correctly. The SDM suggests that
> when the "yellow" status is signaled you should schedule
> service "within a few weeks".

The question is, how realistically does this report true CPU
troubles, statistically? The on-die cache might have the highest
transistor count, but it's not under nearly the same thermal
stress as functional units.

If 90% of all hard CPU failures can be predicted that way then
it's probably useful. If it's only 20%, then not so much.

Also, it's still all theoretical until there's systems out there
where the CPU socket is physically hotpluggable. If there's such
plans in the works then sure, theory becomes reality and then
it's all useful - and then we can do these patches (and more).

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/