Re: next: Commit 'mm: Prevent __alloc_pages_nodemask() RCU CPU stall ...' causing hang on sparc32 qemu

From: Paul E. McKenney
Date: Wed Nov 30 2016 - 16:04:00 EST


On Wed, Nov 30, 2016 at 11:21:59AM -0800, Guenter Roeck wrote:
> On Wed, Nov 30, 2016 at 04:03:33AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2016 at 02:52:11AM -0800, Guenter Roeck wrote:
> > > On 11/29/2016 11:02 PM, Paul E. McKenney wrote:
> > > >On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
> > > >>On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> > > >>>On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> > > >>>>Hi Paul,
> > > >>>>
> > > >>>>most of my qemu tests for sparc32 targets started to fail in next-20161129.
> > > >>>>The problem is only seen in SMP builds; non-SMP builds are fine.
> > > >>>>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> > > >>>>RCU CPU stall warnings"); reverting that commit fixes the problem.
> >
> > And I have dropped this patch. Michal Hocko showed me the error of
> > my ways with this patch.
> >
>
> :-)
>
> On another note, I still get RCU tracebacks in the s390 tests.
>
> BUG: sleeping function called from invalid context at mm/page_alloc.c:3775
>
> That is caused by 'rcu: Maintain special bits at bottom of ->dynticks counter';
> if I recall correctly we had discussed that earlier.

Indeed, I had missed a dyntick counter update back on Nov 11, which meant
that some of the code was still looking at the low-order bit instead of
the next bit up. This is now fixed.

So to get to the error message you call out above, I need to have improperly
left the system in bh state or left irqs disabled, while the system was
running normally without an oops. I am having a hard time seeing how this
patch can do that.

I would be more suspicious of f2a471ffc8a8 ("rcu: Allow boot-time use
of cond_resched_rcu_qs()").

So you bisected or did a revert to work out which was the offending commit?

Thanx, Paul