Re: [PATCH RFC] x86/cpu: fix intermittent lockup on poweroff

From: Dave Hansen
Date: Wed Apr 26 2023 - 14:16:06 EST


On 4/26/23 10:51, Tom Lendacky wrote:
>>> +    /*
>>> +     * native_stop_other_cpus() will write to @stop_cpus_count after
>>> +     * observing that it went down to zero, which will invalidate the
>>> +     * cacheline on this CPU.
>>> +     */
>>> +    atomic_dec(&stop_cpus_count);
>
> This is probably going to pull in a cache line and cause the problem the
> native_wbinvd() is trying to avoid.

Is one _more_ cacheline really the problem?

Or is having _any_ cacheline pulled in a problem? What about the text
page containing the WBINVD? How about all the page table pages that are
needed to resolve %RIP to a physical address?

What about the mds_idle_clear_cpu_buffers() code that snuck into
native_halt()?

> ffffffff810ede4c: 0f 09 wbinvd
> ffffffff810ede4e: 8b 05 e4 3b a7 02 mov 0x2a73be4(%rip),%eax # ffffffff83b61a38 <mds_idle_clear>
> ffffffff810ede54: 85 c0 test %eax,%eax
> ffffffff810ede56: 7e 07 jle ffffffff810ede5f <stop_this_cpu+0x9f>
> ffffffff810ede58: 0f 00 2d b1 75 13 01 verw 0x11375b1(%rip) # ffffffff82225410 <ds.6688>
> ffffffff810ede5f: f4 hlt
> ffffffff810ede60: eb ec jmp ffffffff810ede4e <stop_this_cpu+0x8e>
> ffffffff810ede62: e8 59 40 1a 00 callq ffffffff81291ec0 <trace_hardirqs_off>
> ffffffff810ede67: eb 85 jmp ffffffff810eddee <stop_this_cpu+0x2e>
> ffffffff810ede69: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)