Re: Need help on "Self Detected Stall on CPU"

From: Paul E. McKenney
Date: Thu Apr 30 2020 - 15:17:26 EST


On Thu, Apr 30, 2020 at 06:47:20PM +0000, Atul Kulkarni wrote:
> Dear Sir,
>
> Hope you are doing well. I have watched your various conference videos and have read technical papers.
> We are facing an issue with CPU stall on our systems and I felt like there is no one better who can guide us on how we can deal with it.
>
> I have attached logs for your reference. Towards end I have run couple of sysreq commands and have taken crash dump using sysreq which may help provide additional information.
> Could you please guide us on how we could fix this issue or identify what is going wrong here?

Let's focus on the first few lines of your console message:

[20526.345089] INFO: rcu_preempt self-detected stall on CPU
[20526.351110] 0-...: (1051 ticks this GP) idle=1fe/140000000000002/0 softirq=146268/146268 fqs=0
[20526.360163] (t=2101 jiffies g=96468 c=96467 q=2)
[20526.365535] rcu_preempt kthread starved for 2101 jiffies! g96468 c96467 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0

The last line contains the hint, namely "rcu_preempt kthread starved for
2101 jiffies!" If you don't let RCU's kernel threads run, then RCU CPU
stall warnings are expected behavior.

The "RCU_GP_WAIT_FQS(3)" means that this kthread's last act was to sleep
for three jiffies. As you can see from earlier in that same line, that
was 2101 jiffies ago. The "->state=0x402" means that the scheduler
believes that this kthread is blocked, that is not yet runnable.

The usual way this sort of thing happens is a timer problem, be it a
hardware configuration problem, a timer-driver bug, an interrupt-handling
problem, and so on. This sort of problem is especially common when
bringing up new hardware or when modifying timer code or when modifying
code on the interrupt/exception paths.

So the question to ask yourself is "Why is the timer wakeup not reaching
this kthread?", with special attention to changed code and new hardware.

Thanx, Paul