Re: frequent lockups in 3.18rc4

From: Chris Mason
Date: Fri Dec 05 2014 - 14:04:54 EST




On Fri, Dec 5, 2014 at 1:38 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
On Fri, Dec 5, 2014 at 9:15 AM, Dave Jones <davej@xxxxxxxxxx> wrote:

A bisect later, and I landed on a kernel that ran for a day, before
spewing NMI messages, recovering, and then..

https://urldefense.proofpoint.com/v1/url?u=http://codemonkey.org.uk/junk/log.txt&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=APfD8%2BRkGVsO9UHnH6Oo05Zuoh90VyaaF71AycsnLbQ%3D%0A&s=de71b34f3a7da1c7b8f12dcd760c271657f9f7e2a93b4d2e296b2c687cee5157

I have to admit I'm seeing absolutely nothing sensible in there.

Call it bad, and see if bisection ends up slowly -oh so slowly -
pointing to some direction. Because I don't think it's the hardware,
considering that apparently 3.16 is solid. And the spews themselves
are so incomprehensible that I'm not seeing any pattern what-so-ever.

I went back through all of the traces Dave has posted in this thread. This one looks like vm debugging is on:

http://marc.info/?l=linux-kernel&m=141632237304726&w=2

Another had a function call from CONFIG_DEBUG_PAGEALLOC:

http://marc.info/?l=linux-kernel&m=141701248210949&w=2

So one idea is that our allocation/freeing of pages is dramatically more expensive and we're hitting a strange edge condition. Maybe we're even faulting on a readonly page from a horrible place?

[83246.925234] end_request: I/O error, dev sda, sector 0

Ext3/4 shouldn't be doing IO to sector zero. Something is stomping on ram?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/