Re: [PATCH 0/3] rcu: Add RCU stall diagnosis information

From: Paul E. McKenney
Date: Thu Oct 20 2022 - 19:14:02 EST


On Mon, Oct 17, 2022 at 06:01:05PM +0800, Zhen Lei wrote:
> In some extreme cases, such as the I/O pressure test, the CPU usage may
> be 100%, causing RCU stall. In this case, the printed information about
> current is not useful. Displays the number and usage of hard interrupts,
> soft interrupts, and context switches that are generated within half of
> the CPU stall timeout, can help us make a general judgment. In other
> cases, we can preliminarily determine whether an infinite loop occurs
> when local_irq, local_bh or preempt is disabled.
>
> Zhen Lei (3):
> sched: Add helper kstat_cpu_softirqs_sum()
> sched: Add helper nr_context_switches_cpu()
> rcu: Add RCU stall diagnosis information

Interesting approach, thank you!

I have pulled this in for testing and review, having rescued it from my
spam folder.

Some questions that might come up include: (1) Can the addition of
things like cond_resched() make RCU happier with the I/O pressure test?
(2) Should there be a way to turn this off for environments with slow
consoles? (3) If this information shows heavy CPU usage, what debug
and fix approach should be used?

For an example of #1, if a CPU is flooded with softirq activity, one
might hope that the call to rcu_softirq_qs() would prevent the RCU CPU
stall warning, at least for kernels built with CONFIG_PREEMPT_RT=n.
Similarly, if there are huge numbers of context switches, one might hope
that the rcu_note_context_switch() would report a quiescent state sooner
rather than later.

Thoughts?

Thanx, Paul

> include/linux/kernel_stat.h | 12 +++++++++++
> kernel/rcu/tree.h | 11 ++++++++++
> kernel/rcu/tree_stall.h | 40 +++++++++++++++++++++++++++++++++++++
> kernel/sched/core.c | 5 +++++
> 4 files changed, 68 insertions(+)
>
> --
> 2.25.1
>