Re: General protection fault in kswapd in 2.0.18

Linus Torvalds (torvalds@cs.helsinki.fi)
Fri, 13 Sep 1996 14:12:58 +0300 (EET DST)


On Wed, 11 Sep 1996, Chris Adams wrote:
>
> general protection: 0000
> CPU: 0
> EIP: 0010:[<00119a10>]
> EFLAGS: 00010213
> eax: 02a67000 ebx: 00333a50 ecx: 00000006 edx: 92a67044
> esi: 000003ff edi: 00001ffe ebp: 00008000 esp: 07ff2fa4
> ds: 0018 es: 0018 fs: 0018 gs: 0018 ss: 0018
> Process kswapd (pid: 3, process nr: 3, stackpage=07ff2000)
> Stack: 00000006 00000003 00000000 00000000 02a67000 0011ceff 00000006 00000000
> 00000003 0019ccd2 00000000 00009000 0011d0a3 00000003 00000000 00000000
> 00000100 07ff6fdc 07ff320a 001092bb 00000000 0011cf5c 001b01f8
> Call Trace: [<0011ceff>] [<0011d0a3>] [<001092bb>] [<0011cf5c>]
> Code: f6 42 14 10 74 0e 0f ba 72 14 04 19 c0 0f ba 6b 18 02 19 c0
>
>
> Using `/System.map' to map addresses to symbols.
>
> >>EIP: 119a10 <shrink_mmap+74/1dc>
> Trace: 11ceff <try_to_free_page+3f/9c>
> Trace: 11d0a3 <kswapd+147/158>
> Trace: 1092bb <init+3f/264>
> Trace: 11d0a3 <kswapd+147/158>
>
> Code: 119a10 <shrink_mmap+74/1dc> testb $0x10,0x14(%edx)
> Code: 119a14 <shrink_mmap+78/1dc> je 119a24 <shrink_mmap+88/1dc>
> Code: 119a16 <shrink_mmap+7a/1dc> btrl $0x4,0x14(%edx)
> Code: 119a1b <shrink_mmap+7f/1dc> sbbl %eax,%eax
> Code: 119a1d <shrink_mmap+81/1dc> btsl $0x2,0x18(%ebx)
> Code: 119a22 <shrink_mmap+86/1dc> sbbl %eax,%eax

Ok, I went through this, and it's the buffer "b_this_page" circular linked
list that has gotten corrupted.

Now, the corruption is interesting: the bad pointer in question is in %edx,
and it's 0x92a67044 (which is total nonsense for a kernel pointer). However,
the head of that doubly linked list (and probably the correct value for the
bogus pointer) is 0x02a67000. Now, that's a interesting pattern, actually:

0x92a67044
0x02a67000

Notice the similarity? The first and the two last nibbles have changed, but
other than that it looks like it could be just bit corruption. That's two
bytes with two-bit corruption, and it _might_ be due to memory problems. But
4 incorrect bits is really rather a lot, so it's hard to say. One thing that
you might want to do is to just check out the memory with some good memory
tester (no, "testing extended memory" under DOS with qemm or whatever doesn't
count ;)

Does anybody on the kernel list know of a good test program that is generally
available that can be left running over-night or similar? (Actually, if you
are in a hot and humid area it's probably best left running over day, during
the hottest hours). I'd think that if this is a memory problem it would show
up under any reasonably good test if it can result in even four-bit errors..

Linus