Re: Bug report for RCU stalled warning [3.10.69]

From: Paul E. McKenney
Date: Thu Oct 12 2017 - 16:38:32 EST


[ Adding LKML on CC so that others can find this. ]

On Wed, Oct 11, 2017 at 12:21:39PM +0800, Wang YanQing wrote:
> Hi, Paul McKenney.
>
> I have received many machine-stopped-respone reports, after reboot and
> inspect message, all of them show RCU stalled, but I can't figure out
> how to fix it. I can't update the kernel, it is the painful point, so I
> need to fix it in 3.10. I have attached four messages come from different
> cpu and broads(so I guess it is a BUG instead of hardware fault), any
> suggestion is welcome.

The first step is of course to report this to your distro, as they are
the ones who do the care and feeding of such old kernels. Please include
the information below in that report, as it might help your distro find
and fix the problem.

It looks like the stalled CPU is idle, and that the activity resulting
from the stall-warning message gets things going again. Callbacks are
being processed, so no OOM. But you are getting the splat every 60
seconds. The system has only two CPUs, and is x86.

If you cannot upgrade the kernel, my ability to help is limited. And the
diagnostics printed with the v3.10 CPU stall warnings are also quite
limited. However, there are some things you could try as workarounds:

1. Check to make sure that the rcu_sched kthread is getting
the CPU time that it needs. Preventing this kthread from
running would create exactly this output, assuming that
the stall warning got it going again temporarily.

2. It looks like the disturbance of the RCU CPU stall warning
is getting things going again. Try artificially providing
this disturbance, for example, by running a usermode program
or script that runs on each CPU in turn, then sleeps for
(say) five seconds.

3. If you can reconfigure your kernel, try building with
CONFIG_RCU_FAST_NO_HZ=n.

4. Was the system running reliably on some earlier version?
If so, consider reverting back to that version, and include
the version information in your report to your distro. If
your distro provides individual patches, you should consider
bisecting so as to locate the offending patch.

Good luck with it!

Thanx, Paul