Re: [PATCH] arm64: smp: smp_send_stop() and crash_smp_send_stop() should try non-NMI first

From: Doug Anderson
Date: Mon Jan 08 2024 - 19:55:00 EST


Hi,

On Thu, Dec 7, 2023 at 5:03 PM Douglas Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> When testing hard lockup handling on my sc7180-trogdor-lazor device
> with pseudo-NMI enabled, with serial console enabled and with kgdb
> disabled, I found that the stack crawls printed to the serial console
> ended up as a jumbled mess. After rebooting, the pstore-based console
> looked fine though. Also, enabling kgdb to trap the panic made the
> console look fine and avoided the mess.
>
> After a bit of tracking down, I came to the conclusion that this was
> what was happening:
> 1. The panic path was stopping all other CPUs with
> panic_other_cpus_shutdown().
> 2. At least one of those other CPUs was in the middle of printing to
> the serial console and holding the console port's lock, which is
> grabbed with "irqsave". ...but since we were stopping with an NMI
> we didn't care about the "irqsave" and interrupted anyway.
> 3. Since we stopped the CPU while it was holding the lock it would
> never release it.
> 4. All future calls to output to the console would end up failing to
> get the lock in qcom_geni_serial_console_write(). This isn't
> _totally_ unexpected at panic time but it's a code path that's not
> well tested, hard to get right, and apparently doesn't work
> terribly well on the Qualcomm geni serial driver.
>
> It would probably be a reasonable idea to try to make the Qualcomm
> geni serial driver work better, but also it's nice not to get into
> this situation in the first place.
>
> Taking a page from what x86 appears to do in native_stop_other_cpus(),
> let's do this:
> 1. First, we'll try to stop other CPUs with a normal IPI and wait a
> second. This gives them a chance to leave critical sections.
> 2. If CPUs fail to stop then we'll retry with an NMI, but give a much
> lower timeout since there's no good reason for a CPU not to react
> quickly to a NMI.
>
> This works well and avoids the corrupted console and (presumably)
> could help avoid other similar issues.
>
> In order to do this, we need to do a little re-organization of our
> IPIs since we don't have any more free IDs. We'll do what was
> suggested in previous conversations and combine "stop" and "crash
> stop". That frees up an IPI so now we can have a "stop" and "stop
> NMI".
>
> In order to do this we also need a slight change in the way we keep
> track of which CPUs still need to be stopped. We need to know
> specifically which CPUs haven't stopped yet when we fall back to NMI
> but in the "crash stop" case the "cpu_online_mask" isn't updated as
> CPUs go down. This is why that code path had an atomic of the number
> of CPUs left. We'll solve this by making the cpumask into a
> global. This has a potential memory implication--with NR_CPUs = 4096
> this is 4096/8 = 512 bytes of globals. On the upside in that same case
> we take 512 bytes off the stack which could potentially have made the
> stop code less reliable. It can be noted that the NMI backtrace code
> (lib/nmi_backtrace.c) uses the same approach and that use also
> confirms that updating the mask is safe from NMI.
>
> All of the above lets us combine the logic for "stop" and "crash stop"
> code, which appeared to have a bunch of arbitrary implementation
> differences. Possibly this could make up for some of the 512 wasted
> bytes. ;-)
>
> Aside from the above change where we try a normal IPI and then an NMI,
> the combined function has a few subtle differences:
> * In the normal smp_send_stop(), if we fail to stop one or more CPUs
> then we won't include the current CPU (the one running
> smp_send_stop()) in the error message.
> * In crash_smp_send_stop(), if we fail to stop some CPUs we'll print
> the CPUs that we failed to stop instead of printing all _but_ the
> current running CPU.
> * In crash_smp_send_stop(), we will now only print "SMP: stopping
> secondary CPUs" if (system_state <= SYSTEM_RUNNING).
>
> Fixes: d7402513c935 ("arm64: smp: IPI_CPU_STOP and IPI_CPU_CRASH_STOP should try for NMI")
> Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
> ---
> I'm not setup to test the crash_smp_send_stop(). I made sure it
> compiled and hacked the panic() method to call it, but I haven't
> actually run kexec. Hopefully others can confirm that it's working for
> them.
>
> arch/arm64/kernel/smp.c | 115 +++++++++++++++++++---------------------
> 1 file changed, 54 insertions(+), 61 deletions(-)

The sound of crickets is overwhelming. ;-) Does anyone have any
comments here? Is this a terrible idea? Is this the best idea you've
heard all year (it's only been 8 days, so maybe)? Is this great but
the implementation is lacking (at best)? Do you hate that this waits
for 1 second and wish it waited for 1 ms? 10 ms? 100 ms? 8192 ms?

Aside from the weirdness of a processor being killed while holding the
console lock, it does seem beneficial to give IRQs at least a little
time to finish before killing a processor. I don't have any other
explicit examples, but I could just imagine that things might be a
little more orderly in such a case...

-Doug