Re: [PATCH] sched/debug: avoid executing show_state and causing rcu stall warning

From: Ingo Molnar
Date: Wed Aug 03 2022 - 13:13:10 EST



* Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

>
> [ Adding Paul ]
>
> On Wed, 3 Aug 2022 09:18:45 +0800
> Liu Song <liusong@xxxxxxxxxxxxxxxxx> wrote:
>
> > From: Liu Song <liusong@xxxxxxxxxxxxxxxxx>
> >
> > If the number of CPUs is large, "sysrq_sched_debug_show" will execute for
> > a long time. Every time I execute "echo t > /proc/sysrq-trigger" on my
> > 128-core machine, the rcu stall warning will be triggered. Moreover,
> > sysrq_sched_debug_show does not need to be protected by rcu_read_lock,
> > and no rcu stall warning will appear after adjustment.
> >
> > Signed-off-by: Liu Song <liusong@xxxxxxxxxxxxxxxxx>
> > ---
> > kernel/sched/core.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 5555e49..82c117e 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -8879,11 +8879,11 @@ void show_state_filter(unsigned int state_filter)
> > sched_show_task(p);
> > }
> >
> > + rcu_read_unlock();
> > #ifdef CONFIG_SCHED_DEBUG
> > if (!state_filter)
> > sysrq_sched_debug_show();
>
> If this is just because sysrq_sched_debug_show() is very slow, does RCU
> have a way to "touch" it? Like the watchdogs have? That is, to tell RCU
> "Yes I know I'm taking a long time, but I'm still making forward progress,
> don't complain about me". Then the sysrq_sched_debug_show() could have:
>
> for_each_online_cpu(cpu) {
> /*
> * Need to reset softlockup watchdogs on all CPUs, because
> * another CPU might be blocked waiting for us to process
> * an IPI or stop_machine.
> */
> touch_nmi_watchdog();
> touch_all_softlockup_watchdogs();
> + touch_rcu();
> print_cpu(NULL, cpu);
> }

I'd much rather we use the specific exclusion primitive suitable for that
sequence - in that case it should be cpus_read_lock()/unlock() I suspect.

But the entire code sequence should be reviewed - do we anywhere walk task
lists that need RCU protection?

My main complaint was that we cannot just randomly drop the RCU lock with
no inspection of the underlying code.

Ingo