Frequent oops in shrink_mmap

David desJardins (desj@google.com)
Wed, 17 Nov 1999 23:42:40 -0800


Google runs on a cluster of 2000+ Linux/Intel machines. We have a
problem on several of those machines, a symptom of which is kernel oops
messages in try_to_free_buffers, called from shrink_mmap. (The machines
which have these kernel oops messages tend to have poor performance in
general, even at times when the oops aren't being generated.)

We are currently running 2.2.7 on these machines, but I see from
searching linux-kernel traffic that the problem of kernel oops in
shrink_mmap is a long-standing one which has been reported in many
different kernel versions, up to at least 2.2.12, and has also been
reported in Linux/Alpha. I've looked at all of the past archives, and I
didn't see any definitive response to these problems.

The problem is probably related to heavy disk use and/or heavy network
traffic and/or heavy parallel processing with threads. These are the
circumstances under which we tend to see the problem, and also are
circumstances that I see correlated with past reports of this problem.

I have 56 of these oops messages from 10 different machines, all at
exactly the same location. I've attached a sample at the end of this
note. Most of them look very similar (to each other, and to other oops
messages previously posted to this list). But I would be happy to
forward all of them to anyone who can use them. I could collect even
more if that would be helpful.

It's possible that the problem is associated with or caused by a
hardware problem. Out of thousands of machines, we could easily have 10
with some specific problem. But I don't really know what kind of a
problem, or why it would manifest itself only in this particular way.
Also, the fact that the same problem occurs in Linux/Alpha makes me
suspect some sort of software flaw.

We would greatly appreciate any constructive suggestions or help with
this problem. I would be glad to send a Google T-shirt to anyone who
can help.

-- David desJardins

*** Linux s17 2.2.7 #6 Mon May 10 16:58:52 PDT 1999 i686 unknown
Nov 15 15:50:42 s17 kernel: Unable to handle kernel paging request at virtual address 40000014
Nov 15 15:50:42 s17 kernel: current->tss.cr3 = 00101000, %cr3 = 00101000
Nov 15 15:50:42 s17 kernel: *pde = 00000000
Nov 15 15:50:42 s17 kernel: Oops: 0000
Nov 15 15:50:42 s17 kernel: CPU: 0
Nov 15 15:50:42 s17 kernel: EIP: 0010:[try_to_free_buffers+18/136]
Nov 15 15:50:42 s17 kernel: EFLAGS: 00010202
Nov 15 15:50:42 s17 kernel: eax: 40000000 ebx: c029b320 ecx: 00000006 edx: 00020000
Nov 15 15:50:42 s17 kernel: esi: 40000000 edi: 40000000 ebp: c029b320 esp: c000dfa8
Nov 15 15:50:42 s17 kernel: ds: 0018 es: 0018 ss: 0018
Nov 15 15:50:42 s17 kernel: Process kswapd (pid: 4, process nr: 4, stackpage=c000d000)
Nov 15 15:50:42 s17 kernel: Stack: 00000030 c000c000 c011b386 c029b320 00000020 00000006 c01200eb 00000006
Nov 15 15:50:42 s17 kernel: 00000030 00000000 c01a1eee c000c1c1 c0120197 00000030 00000f00 c0003fb4
Nov 15 15:50:42 s17 kernel: c0106000 00009000 c01073eb 00000000 00000f00 c01cdfd8
Nov 15 15:50:42 s17 kernel: Call Trace: [shrink_mmap+218/300] [do_try_to_free_pages+35/120] [tvecs+5218/20156] [kswapd+87/200] [get_options+0/116] [kernel_thread+35/48]
Nov 15 15:50:42 s17 kernel: Code: 8b 76 14 83 78 20 00 75 06 f6 40 18 46 74 13 6a 00 e8 4c 01

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/