Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust

From: Ashok Raj
Date: Thu Jun 15 2023 - 21:58:48 EST


Hi Thomas,

On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
> Tony reported intermittent lockups on poweroff. His analysis identified the
> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
> on SME enabled machines a kexec() does not leave any stale data in the
> caches when switching from encrypted to non-encrypted mode or vice versa.
>
> That wbindv() is conditional on the SME feature bit which is read directly
> from CPUID. But that readout does not check whether the CPUID leaf is
> available or not. If it's not available the CPU will return the value of
> the highest supported leaf instead. Depending on the content the "SME" bit
> might be set or not.
>
> That's incorrect but harmless. Making the CPUID readout conditional makes
> the observed hangs go away, but it does not fix the underlying problem:
>
> CPU0 CPU1
>
> stop_other_cpus()
> send_IPIs(REBOOT); stop_this_cpu()
> while (num_online_cpus() > 1); set_online(false);
> proceed... -> hang
> wbinvd()
>
> WBINVD is an expensive operation and if multiple CPUs issue it at the same
> time the resulting delays are even larger.
>
> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
> which causes the system to hang.
>
> This issue exists independent of WBINVD, but the delays caused by WBINVD
> make it more prominent.
>
> Make this more robust by adding a cpumask which is initialized to the
> online CPU mask before sending the IPIs and CPUs clear their bit in
> stop_this_cpu() after the WBINVD completed. Check for that cpumask to
> become empty in stop_other_cpus() instead of watching num_online_cpus().
>
> The cpumask cannot plug all holes either, but it's better than a raw
> counter and allows to restrict the NMI fallback IPI to be sent only to
> the CPUs which have not reported within the timeout window.
>
> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
> Reported-by: Tony Battersby <tonyb@xxxxxxxxxxxxxxx>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@xxxxxxxxxxxxxxx
> ---
> V3: Use a cpumask to make the NMI case slightly safer - Ashok
> ---
> arch/x86/include/asm/cpu.h | 2 +
> arch/x86/kernel/process.c | 23 +++++++++++++-
> arch/x86/kernel/smp.c | 71 +++++++++++++++++++++++++++++++--------------
> 3 files changed, 73 insertions(+), 23 deletions(-)

I tested them and seems to work fine on my system.

Maybe Tony can check in his setup would be great.

One thought on sending NMI below.

[snip]

>
> /* if the REBOOT_VECTOR didn't work, try with the NMI */
> - if (num_online_cpus() > 1) {
> + if (!cpumask_empty(&cpus_stop_mask)) {
> /*
> * If NMI IPI is enabled, try to register the stop handler
> * and send the IPI. In any case try to wait for the other
> * CPUs to stop.
> */
> if (!smp_no_nmi_ipi && !register_stop_handler()) {
> + u32 dm;
> +
> /* Sync above data before sending IRQ */
> wmb();
>
> pr_emerg("Shutting down cpus with NMI\n");
>
> - apic_send_IPI_allbutself(NMI_VECTOR);
> + dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
> + dm |= APIC_DM_NMI;
> +
> + for_each_cpu(cpu, &cpus_stop_mask) {
> + u32 apicid = apic->cpu_present_to_apicid(cpu);
> +
> + apic_icr_write(dm, apicid);
> + apic_wait_icr_idle();

can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??