Re: [PATCH] x86/mce: Don't unregister CPU hotplug notifier in error path

From: Boris Ostrovsky
Date: Fri Jun 20 2014 - 11:40:14 EST


On 06/20/2014 11:23 AM, Borislav Petkov wrote:
On Fri, Jun 20, 2014 at 10:28:13AM -0400, Boris Ostrovsky wrote:
Commit 9c15a24b038f4d8da93a2bc2554731f8953a7c17 (x86/mce: Improve
mcheck_init_device() error handling) unregisters (or never registers)
MCE's hotplug notifier if an error is encountered.
Well, mcheck_init_device() did encounter errors before that commit too,
can you please go into detail on how exactly you're triggering this?
Which error are you talking about exactly?

You can simulate this on baremetal by having, for example, misc_register() fail (just add 'err = -EOI' after the call). Or you can return an error right upon entry to mcheck_init_device() (I haven't tested that though).

Then, after you are booted do a couple of
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

Then sit still for about 10 minutes. I don't think any activity is necessary.

You are dead now. If you are lucky you may see messages about soft lockups or RCU stalls but often nothing.

Lemme guess: some xen special handling which baremetal doesn't need.

Only in the sense that on Xen misc_register() often fails. But any failure on baremetal will result in the same behavior.


Since unplugging a CPU would normally result in the notifier deleting
MCE timer we are now left with the timer running if a CPU is removed on
a system where mcheck_init_device() had failed.

If we later hotplug this CPU back we add this timer again in
mcheck_cpu_init()). Eventually the two timers start intefering with each
other, causing soft lockups or system hangs.

We should leave the notifier always on and, in fact, set it up early
during the boot.
We do leave it always on - we only unregister it if we've encountered an
error.

Right. And I think we shouldn't because we leave undeleted timers.

-boris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/