Re: [RESEND PATCH 2/2] smp: Reduce NMI traffic from CSD waiters to CSD destination.

From: Paul E. McKenney
Date: Tue May 16 2023 - 08:09:13 EST


On Tue, May 09, 2023 at 08:31:24AM +1000, Imran Khan wrote:
> On systems with hundreds of CPUs, if few hundred or most of the CPUs
> detect a CSD hang, then all of these waiters endup sending an NMI to
> destination CPU to dump its backtrace.
> Depending on the number of such NMIs, destination CPU can spent
> a significant amount of time handling these NMIs and thus making
> it more difficult for this CPU to address those pending CSDs timely.
> In worst case it can happen that by the time destination CPU is done
> handling all of the above mentioned backtrace NMIs, csd wait time
> may have elapsed and all of the waiters start sending backtrace NMI
> again and this behaviour continues in loop.
>
> To avoid the above mentioned scenario, issue backtrace NMI only from
> first waiter. The other waiters to same CSD destination can make use
> of backtrace obtained via fist waiter's NMI.
>
> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>

Reviewed-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

> ---
> kernel/smp.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index b7ccba677a0a0..a1cd21ea8b308 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -43,6 +43,8 @@ static DEFINE_PER_CPU_ALIGNED(struct call_function_data, cfd_data);
>
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue);
>
> +static DEFINE_PER_CPU(atomic_t, trigger_backtrace) = ATOMIC_INIT(1);
> +
> static void __flush_smp_call_function_queue(bool warn_cpu_offline);
>
> int smpcfd_prepare_cpu(unsigned int cpu)
> @@ -242,7 +244,8 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
> *bug_id, !cpu_cur_csd ? "unresponsive" : "handling this request");
> }
> if (cpu >= 0) {
> - dump_cpu_task(cpu);
> + if (atomic_cmpxchg_acquire(&per_cpu(trigger_backtrace, cpu), 1, 0))
> + dump_cpu_task(cpu);
> if (!cpu_cur_csd) {
> pr_alert("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", *bug_id, raw_smp_processor_id(), cpu);
> arch_send_call_function_single_ipi(cpu);
> @@ -423,9 +426,14 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
> struct llist_node *entry, *prev;
> struct llist_head *head;
> static bool warned;
> + atomic_t *tbt;
>
> lockdep_assert_irqs_disabled();
>
> + /* Allow waiters to send backtrace NMI from here onwards */
> + tbt = this_cpu_ptr(&trigger_backtrace);
> + atomic_set_release(tbt, 1);
> +
> head = this_cpu_ptr(&call_single_queue);
> entry = llist_del_all(head);
> entry = llist_reverse_order(entry);
> --
> 2.34.1
>