Re: [PATCH RFC] x86/cpu: fix intermittent lockup on poweroff

From: H. Peter Anvin
Date: Tue Apr 25 2023 - 20:11:39 EST


On April 25, 2023 3:29:49 PM PDT, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>On 4/25/23 14:05, Thomas Gleixner wrote:
>> The only consequence of looking at bit 0 of some random other leaf is
>> that all CPUs which run stop_this_cpu() issue WBINVD in parallel, which
>> is slow but should not be a fatal issue.
>>
>> Tony observed this is a 50% chance to hang, which means this is a timing
>> issue.
>
>I _think_ the system in question is a dual-socket Westmere. I don't see
>any obvious errata that we could pin this on:
>
>> https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5600-specification-update.pdf
>
>Andi Kleen had an interesting theory. WBINVD is a pretty expensive
>operation. It's possible that it has some degenerative behavior when
>it's called on a *bunch* of CPUs all at once (which this path can do).
>If the instruction takes too long, it could trigger one of the CPU's
>internal lockup detectors and trigger a machine check. At that point,
>all hell breaks loose.
>
>I don't know the cache coherency protocol well enough to say for sure,
>but I wonder if there's a storm of cache coherency traffic as all those
>lines get written back. One of the CPUs gets starved from making enough
>forward progress and trips a CPU-internal watchdog.
>
>Andi also says that it _should_ log something in the machine check banks
>when this happens so there should be at least some kind of breadcrumb.
>
>Either way, I'm hoping this hand waving satiates tglx's morbid curiosity
>about hardware that came out from before I even worked at Intel. ;)

"Pretty expensive" doesn't really cover it. It is by far the longest time an x86 CPU can block out all outside events.