Re: next: Commit 'mm: Prevent __alloc_pages_nodemask() RCU CPU stall ...' causing hang on sparc32 qemu

From: Paul E. McKenney
Date: Wed Nov 30 2016 - 20:20:00 EST


On Wed, Nov 30, 2016 at 03:18:46PM -0800, Guenter Roeck wrote:
> On Wed, Nov 30, 2016 at 01:01:52PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2016 at 11:21:59AM -0800, Guenter Roeck wrote:
> > > On Wed, Nov 30, 2016 at 04:03:33AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 30, 2016 at 02:52:11AM -0800, Guenter Roeck wrote:
> > > > > On 11/29/2016 11:02 PM, Paul E. McKenney wrote:
> > > > > >On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote:
> > > > > >>On 11/29/2016 05:28 PM, Paul E. McKenney wrote:
> > > > > >>>On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote:
> > > > > >>>>Hi Paul,
> > > > > >>>>
> > > > > >>>>most of my qemu tests for sparc32 targets started to fail in next-20161129.
> > > > > >>>>The problem is only seen in SMP builds; non-SMP builds are fine.
> > > > > >>>>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask()
> > > > > >>>>RCU CPU stall warnings"); reverting that commit fixes the problem.
> > > >
> > > > And I have dropped this patch. Michal Hocko showed me the error of
> > > > my ways with this patch.
> > > >
> > >
> > > :-)
> > >
> > > On another note, I still get RCU tracebacks in the s390 tests.
> > >
> > > BUG: sleeping function called from invalid context at mm/page_alloc.c:3775
> > >
> > > That is caused by 'rcu: Maintain special bits at bottom of ->dynticks counter';
> > > if I recall correctly we had discussed that earlier.
> >
> > Indeed, I had missed a dyntick counter update back on Nov 11, which meant
> > that some of the code was still looking at the low-order bit instead of
> > the next bit up. This is now fixed.
> >
> > So to get to the error message you call out above, I need to have improperly
> > left the system in bh state or left irqs disabled, while the system was
> > running normally without an oops. I am having a hard time seeing how this
> > patch can do that.
> >
> > I would be more suspicious of f2a471ffc8a8 ("rcu: Allow boot-time use
> > of cond_resched_rcu_qs()").
> >
> > So you bisected or did a revert to work out which was the offending commit?
> >
>
> My most recent bisect was with the November 10 image, so that would have missed
> any later fix. Comparing the log messages, the current message is indeed
> different. Sorry, I mixed that up; I just assumed that the problem would be
> the same without really checking. My bad.
>
> Bisect would be tricky, since the s390 image was broken for some time after
> November 10. The first time I have seen the above BUG: was with next-20161128
> (which is the first build after the crash was fixed). That version did not
> include f2a471ffc8a8, so that can not be the cause.
>
> I'll try to set up a bisect tonight, working around the crash problem.
> I'll let you know how it goes.

Whew! You had me going for a bit there. ;-)

Thanx, Paul