Re: [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage

From: Borislav Petkov
Date: Tue Jun 20 2023 - 05:23:54 EST


On Thu, Jun 15, 2023 at 10:33:57PM +0200, Thomas Gleixner wrote:
> TLDR: It's a mess.

LOL. You don't need the rest of the commit message - this says it all.
:-P

> When kexec() is executed on a system with "offline" CPUs, which are parked

I'd say simply "with offlined CPUs"

> in mwait_play_dead() it can end up in a triple fault during the bootup of
> the kexec kernel or cause hard to diagnose data corruption.
>
> The reason is that kexec() eventually overwrites the previous kernels text,

kernel's

> page tables, data and stack, If it writes to the cache line which is

... and stack. If it ...

> monitored by an previously offlined CPU, MWAIT resumes execution and ends

... by a previously...

> up executing the wrong text, dereferencing overwritten page tables or
> corrupting the kexec kernels data.

Lovely.

> Cure this by bringing the offline CPUs out of MWAIT into HLT.

... offlined CPUs... :

> Write to the monitored cache line of each offline CPU, which makes MWAIT

ditto.

> resume execution. The written control word tells the offline CPUs to issue

ditto and so on.

> HLT, which does not have the MWAIT problem.
>
> That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
> those make it come out of HLT.
>
> A follow up change will put them into INIT, which protects at least against
> NMI and SMI.

...

> @@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void)
> (highest_subcstate - 1);
> }
>
> + /* Set up state for the kexec() hack below */
> + md->status = CPUDEAD_MWAIT_WAIT;
> + md->control = CPUDEAD_MWAIT_WAIT;
> +
> wbinvd();
>
> while (1) {
> @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)

JFYI: that last hunk has some conflicts applying to latest tip/master.
Might need merge resolving...

> mb();
> __mwait(eax, 0);
>
> + if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> + /*
> + * Kexec is about to happen. Don't go back into mwait() as
> + * the kexec kernel might overwrite text and data including
> + * page tables and stack. So mwait() would resume when the
> + * monitor cache line is written to and then the CPU goes
> + * south due to overwritten text, page tables and stack.
> + *
> + * Note: This does _NOT_ protect against a stray MCE, NMI,
> + * SMI. They will resume execution at the instruction
> + * following the HLT instruction and run into the problem
> + * which this is trying to prevent.
> + */
> + WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
> + while(1)
> + native_halt();
> + }
> +
> cond_wakeup_cpu0();
> }
> }
>
> +/*
> + * Kick all "offline" CPUs out of mwait on kexec(). See comment in
> + * mwait_play_dead().
> + */
> +void smp_kick_mwait_play_dead(void)
> +{
> + u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;

Do you even need this newstate thing?

> + struct mwait_cpu_dead *md;
> + unsigned int cpu, i;
> +
> + for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
> + md = per_cpu_ptr(&mwait_cpu_dead, cpu);
> +
> + /* Does it sit in mwait_play_dead() ? */
> + if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
> + continue;
> +
> + /* Wait maximal 5ms */

/* Wait 5ms maximum */

> + for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
> + /* Bring it out of mwait */
> + WRITE_ONCE(md->control, newstate);
> + udelay(5);
> + }
> +
> + if (READ_ONCE(md->status) != newstate)
> + pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);

Shouldn't this be a pr_err_once thing so that it doesn't flood the
console unnecessarily?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette