Re: Oops and painful death of box, possibly solved

Rick Franchuk (rickf@transpect.net)
Thu, 1 Apr 1999 09:43:34 -0800 (PST)


On Wed, 31 Mar 1999, Simon Kirby wrote:

> What kernel version, what compiler, and what binutils (ld -v) were used?
> 0x0000000d looks either like odd memory corruption or a broken compile.
> Did the two machines that were screwing up get the same OOPSes exactly?
> Was MTRR enabled in the config? Were all servers running the same kernel?

Well, at least one of your questions was in the message (2.2.4 and 2.2.5,
same problem). MTRR off, SMP off, compiler is what came with RH5.2 stock:

Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.7.2.3/specs
gcc version 2.7.2.3

I couldn't tell you if they were THE EXACT SAME oops, as I said I'm some
thousand miles away, and $5.50/hr overnight system watchmen don't really give
a rats ass outside of rebooting a machine when it goes down. =(

Problem persisted with all servers using a single kernel copied across the
machines, and individually compiled kernels on the machines.

> I doubt disabling of the IDE would affect the dentry cache in any way.
> You can boot the kernel with ide0=noprobe ide1=noprobe to stop it from
> touching IDE (or take it out of the kernel).

Strange, no? It's been running for a few days now without incident, where
before it had a MTBF of around 4-6 hours. My thought is that there's
something just slightly wonky with these particular motherboards, but I've
not kernel hacked much so my perceptions might be offbase there. At the time,
I was reluctant to remove ALL ide support, in case we wanted/needed to jam in
a CDRom reader for some reason (which would only require a top off and couple
of cables, not top off, cables and kernel recompile).

Perhaps it's oops-ing in dentry as a result of some damage elsewhere. Kind of
a delayed reaction thing. Again, my thoughts on this matter are mired in
near-total-ignorance of kernel internals (outside of what 'writing linux
device drivers' taught me).

> How's it going, btw? ;)

As of monday I'm shuttling back and forth to L.A. every other week to work on
Yet Another Bigass Project (YABP)... one of the several I offered you a job
helping me with ;) S'ok though, I'll cope. I'll just stock up on juicy
fruit, 9mm Hollow Point, and cookies. Mmmmmmmm... cookies!

David still bitching people out for eating around the computers?

> On Wed, 31 Mar 1999, Rick Franchuk wrote:
>
> > Recently, I had a contractor of mine install a five Intel boxes (PII-400s and
> > PII-450s) in a provider in San Jose. Although all the pieces in all the
> > machines were identical, two started producing the following oops under what
> > appeared to be moderate to heavy disk usage:
> >
> > Unable to handle kernel NULL pointer dereference at virtual address 0000000b
> > current->tss.cr3 = 012c7000, pr3 = 012c7000
> > *pde = 00000000
> > Oops: 0000
> > CPU: 0
> > EIP: 0010:[<c012d075>]
> > EFLAGS: 00010292
> > eax: 00001960 ebx: fffffff3 ecx: 49913b2c edx: 49913fb4
> > esi: c020d394 edi: 00000001 ebp: 0000000b esp: c54c5f38
> > ds: 0018 es: 0018 ss: 0018
> > Process httpd (pid: 17592, process nr: 58, stackpage=c54c5000)
> > Stack: 00000001 c2355c00 c020d394 c301301d 874e0363 0000000e c01288b4 c2355c00
> > c54c5f80 c54c5f80 c0128ae0 c2355c00 c54c5f80 c3013000 c3013000 00000001
> > bffffbd0 c3013000 c301301d 0000000e 874e0363 c0128bc5 c3013000 00000000
> > Call Trace: [<c01288b4>] [<c0128ae0>] [<c0128bc5>] [<c0126caf>] [<c0107a40>]
> > Code: 8b 6d 00 8b 74 24 18 39 73 48 75 eb 8b 74 24 24 39 73 0c 75
> >
> > >>EIP: c012d075 <d_lookup+65/dc>
> > Trace: c01288b4 <cached_lookup+10/4c>
> > Trace: c0128ae0 <lookup_dentry+fc/1b8>
> > Trace: c0128bc5 <__namei+29/5c>
> > Trace: c0126caf <sys_newstat+13/64>
> > Trace: c0107a40 <system_call+34/38>
> > Code: c012d075 <d_lookup+65/dc> 00000000 <_EIP>: <===
> > Code: c012d075 <d_lookup+65/dc> 0: 8b 6d 00 movl 0x0(%ebp),%ebp <===
> > Code: c012d078 <d_lookup+68/dc> 3: 8b 74 24 18 movl 0x18(%esp,1),%esi
> > Code: c012d07c <d_lookup+6c/dc> 7: 39 73 48 cmpl %esi,0x48(%ebx)
> > Code: c012d07f <d_lookup+6f/dc> a: 75 eb jne c012d06c <d_lookup+5c/dc>
> > Code: c012d081 <d_lookup+71/dc> c: 8b 74 24 24 movl 0x24(%esp,1),%esi
> > Code: c012d085 <d_lookup+75/dc> 10: 39 73 0c cmpl %esi,0xc(%ebx)
> > Code: c012d088 <d_lookup+78/dc> 13: 75 00 jne c012d08a <d_lookup+7a/dc>
> >
> > A numer of oopses would happen in rapid succession, followed by segfaults of
> > whatever happened to be running and 'cannot fork()' messages streaming down
> > the screen locally (I never saw them though... I'm in vancouver, so I can't
> > detail exactly what was on the screen if it wasn't logged).
> >
> > Curiously, the machine also exhibited the following during boot up (Which was
> > annoying, because the 'timeouts' involved were fairly long):
> >
> > hda: no response (status = 0xa1), resetting drive
> > hda: no response (status = 0xa1)
> > hdb: no response (status = 0xa1), resetting drive
> > hdb: no response (status = 0xa1)
> > hdc: no response (status = 0xa1), resetting drive
> > hdc: no response (status = 0xa1)
> > hdd: no response (status = 0xa1), resetting drive
> > hdd: no response (status = 0xa1)
> >
> > I have a feeling that this is significant, as once I was able to get our man
> > in Cali to completely disable all onboard IDE controllers (we run 100% SCSI
> > using Adaptec 2940UWs, but the OOPSen flared up when on an NCR53c875 we
> > decided to test), the oops now SEEM to have totally dissolved... I'm writing
> > in hopes that it could be confirmed that this is indeed the source of the
> > error (to let me sleep sounder at night) and if it's a specific board-related
> > issue I can find out the model number so you all can avoid it. ;)

--
  __________________________________________
 |                                          |
 |  Rick Franchuk  -  TranSpecT Consulting  |
 |_______                            _______|
         \mailto:rickf@transpect.net/
          \_____ICQ_#_4435025______/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/