Re: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long

From: Imran Khan
Date: Thu Oct 05 2023 - 19:34:15 EST


Hello Paul,

On 6/10/2023 3:48 am, Paul E. McKenney wrote:
> The CSD lock seems to get stuck in 2 "modes". When it gets stuck
> temporarily, it usually gets released in a few seconds, and sometimes
> up to one or two minutes.
>
> If the CSD lock stays stuck for more than several minutes, it never
> seems to get unstuck, and gradually more and more things in the system
> end up also getting stuck.
>
> In the latter case, we should just give up, so the system can dump out
> a little more information about what went wrong, and, with panic_on_oops
> and a kdump kernel loaded, dump a whole bunch more information about
> what might have gone wrong.
>
> Question: should this have its own panic_on_ipistall switch in
> /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different
> way than via BUG_ON?
>
panic_on_ipistall (set to 1 by default) looks better option to me. For systems
where such delay is acceptable and system can eventually get back to sane state,
this option (set to 0 after boot) would prevent crashing the system for
apparently benign CSD hangs of long duration.

> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 8455a53465af..059f1f53fc6b 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -230,6 +230,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
> }
>
> ts2 = sched_clock();
> + /* How long since we last checked for a stuck CSD lock.*/
> ts_delta = ts2 - *ts1;
> if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
> return false;
> @@ -243,9 +244,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
> else
> cpux = cpu;
> cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
> + /* How long since this CSD lock was stuck. */
> + ts_delta = ts2 - ts0;
> pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
> - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
> + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
> cpu, csd->func, csd->info);
> + /*
> + * If the CSD lock is still stuck after 5 minutes, it is unlikely
> + * to become unstuck. Use a signed comparison to avoid triggering
> + * on underflows when the TSC is out of sync between sockets.
> + */
> + BUG_ON((s64)ts_delta > 300000000000LL);
Can we make this a module_param (default value 5 mins), so that if needed it can
be tweaked to a bigger/smaller value?
> if (cpu_cur_csd && csd != cpu_cur_csd) {
> pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
> *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),

Thanks,
Imran