RE: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.

From: Luck, Tony
Date: Fri Dec 04 2015 - 12:23:47 EST


> Franky, I'm not sure at all and very very wary of adding *any* code
> which runs on an offlined CPU. Because *no one* does that and it hasn't
> been tested at all. So who knows what happens.
>
> What we should be doing is execute the *minimal* amount of code possible
> and get out. No counting, no per-cpu variables. No nothing.

The minimal code requires we use:

smp_processor_id() [to get our cpu number]
cpu_is_offline() [to find out the cpu is offline]

The first of those looks more dangerous in that it accesses a per-cpu variable.

I don't think we need to be totally paranoid here. We know that the offline cpus
were once online and went through normal kernel initialization code (if they didn't,
then we can't possibly be executing this code ... their CR4.MCE bit would be zero so their
response to a machine check would have been to reset the system).

> Because we have been considering offlining a core as one possible RAS
> action. So what happens is a user or a RAS agent offlines a core and
> yet, that offlined core still reports MCEs. Something's terribly wrong
> with that picture, IMO.

Agreed. It would be more pleasant if we had some way to *really* offline a cpu,
including telling the rest of the system not to send it any more broadcast events
like MCE, SMI. But the h/w guys like to give the s/w guys job security by making
these corner cases that we have to work around in s/w :-)

-Tony
N‹§²æ¸›yú²X¬¶ÇvØ–)Þ{.nlj·¥Š{±‘êX§¶›¡Ü}©ž²ÆzÚj:+v‰¨¾«‘êZ+€Êzf£¢·hšˆ§~†­†Ûÿû®w¥¢¸?™¨è&¢)ßf”ùy§m…á«a¶Úÿ 0¶ìå