RE: [PATCH v4 0/7] x86: BSP or CPU0 online/offline

From: Luck, Tony
Date: Wed Dec 07 2011 - 12:08:09 EST


> More importantly, you generally *cannot* realistically continue
> with a bad CPU anyway - the system will crash or will show signs
> of corruptions and you *want* a full powerdown and a clean
> reboot.

See the "Enhanced cache error reporting" section in the Intel
Software Developers manual (section 15.4 in volume 3B of the
latest edition). Intel provides what is probably a very early
notification in many cases that a processors cache is experiencing
problems. At the time of the notification the system is still
functioning correctly. The SDM suggests that when the "yellow"
status is signaled you should schedule service "within a few weeks".

24x7 systems with a lot of sockets & cores, and highly paranoid
administrators, might want to take action to stop using the cores
that share the cache with problems sooner than they can schedule
downtime.

>- Special hardware environments that are deeply redundant and
> can warn about 'soft' failures well before hard failures
> which gives a realistic window of time for a maintenance
> hot-swap. [Such hardware actually exists, i even worked with
> an x86 one eons ago.]

So not so special any more - every Xeon since Core Duo has the
cache error reporting capability.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/