Hard crash, 2.2.16pre2

From: root (root@china.patternbook.com)
Date: Mon May 08 2000 - 01:04:41 EST

[1.] One line summary of the problem:

TCP related hard crash of 2.2.16pre2

[2.] Full description of the problem/report:

This is a system for which I've reported TCP related crashes on kernels back
to 2.2.13. However, every single bit of hardware has been changed along the
way (one part at a time) aside from the Tekram SCSI controller and the SCSI
hard drive and CD drive, which don't seem suspects in this. The ram is now
ECC. Saw similar crash in 2.2.15pre20, but didn't have time to copy the
screen on that. That was with kernel's 3com driver, this latest is with
3com's 3c90x. Crashes are much less frequent now - it's made it almost two
weeks (but twice one day recently under 2.2.15pre17). The current crash was
after less than four days. It's not that busy a system, but is running
Apache, sendmail, bind 8, proftpd, ipchains, and masquerading (for two other
boxes), and answering to 8 outside IPs. I have two similar systems elsewhere
in terms of software configuration and function that have never crashed over
some months, but those carry lighter loads, one with 2.2.13 and one with
2.2.14. The current crash is in a different segment of TCP code than earlier
ones (used to be in tcp_keepalive), but still a null pointer dereference
crash. May be peeling an onion here to get to the central bug?

[4.] Kernel version:

Linux version 2.2.16pre2 (root@china.patternbook.com) (gcc version
#1 Thu May 4 13:14:02 EDT 2000

[5.] Output of Oops.. [The oops page was typed in by hand, but carefully. It
would be real real nice if the kernel had greater facility in capturing
these ;-| ]

Unable to handle kernel NULL pointer dereference at virtual address 00000008
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010: [<c01796ce>]
eax: c35a8dc0 ebx: c745c640 ecx: 00000000 edx: c35a9088
esi: c745c640 edi: 00000001 ebp: c0217f48 esp: c0217f18
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, process nr: 0, stackpage=c0217000)
Stack: c745c640 c745c640 00000007 c016da92 c745c640 c745c640 c016d9d8 c01113ed
        c745c640 00000001 c0254384 00000000 c0217f60 c0117bdd 00000000 c0216000
        01bfbb3f c010a311 00000e00 c0109fe0 00000000 c0216000 00000000 c0216000
Call Trace: [<c016da92>] [<c016d9d8>] [<c01113ed>] [<c0117bdd>] [<c010a311>] [<c0109fe0>] [<c01078a9>]
            [<c0106000>] [<c01078cc>] [<c01090fc>] [<c0106000>] [<c010607b>] [<c0106000>] [<c0100175>]
Code: 83 79 08 00 75 24 8b 51 04 85 d2 74 06 8b 41 0c 89 42 0c 8b

>>EIP: c01796ce <tcp_v4_unhash+72/a4>
Trace: c016da92 <net_timer+ba/13c>
Trace: c016d9d8 <net_timer+0/13c>
Trace: c01113ed <timer_bh+2e9/330>
Trace: c0117bdd <do_bottom_half+49/64>
Trace: c010a311 <do_IRQ+39/40>
Trace: c0109fe0 <common_interrupt+18/20>
Trace: c01078a9 <cpu_idle+61/70>
Trace: c0106000 <get_options+0/74>
Code: c01796ce <tcp_v4_unhash+72/a4> 00000000 <_EIP>: <===
Code: c01796ce <tcp_v4_unhash+72/a4> 0: 83 79 08 00 cmpl $0x0,0x8(%ecx) <===
Code: c01796d2 <tcp_v4_unhash+76/a4> 4: 75 24 jne c01796f8 <tcp_v4_unhash+9c/a4>
Code: c01796d4 <tcp_v4_unhash+78/a4> 6: 8b 51 04 mov 0x4(%ecx),%edx
Code: c01796d7 <tcp_v4_unhash+7b/a4> 9: 85 d2 test %edx,%edx
Code: c01796d9 <tcp_v4_unhash+7d/a4> b: 74 06 je c01796e1 <tcp_v4_unhash+85/a4>
Code: c01796db <tcp_v4_unhash+7f/a4> d: 8b 41 0c mov 0xc(%ecx),%eax
Code: c01796de <tcp_v4_unhash+82/a4> 10: 89 42 0c mov %eax,0xc(%edx)
Code: c01796e1 <tcp_v4_unhash+85/a4> 13: 8b 00 mov (%eax),%eax

Aiee, killing interrupt handler
Kernel panic: Attempted to kll the idle task!
In swapper task - not syncing

[7.] Environment
[7.1.] Software (add the output of the ver_linux script here)

Linux china.patternbook.com 2.2.16pre2 #1 Thu May 4 13:14:02 EDT 2000 i586 unknown
Kernel modules 2.3.9
Gnu C
Linux C Library 2.1.1
Dynamic linker ldd (GNU libc) 2.1.1
Procps 2.0.2
Mount 2.9o
Net-tools 1.52
Console-tools 1999.03.02
Sh-utils 1.16
Modules Loaded ip_masq_raudio ip_masq_ftp 3c90x

[7.2.] Processor information (from /proc/cpuinfo):

processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 451.030
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mmx 3dnow
bogomips : 897.84

[7.3.] Module information (from /proc/modules):

ip_masq_raudio 2896 0 (unused)
ip_masq_ftp 2508 0 (unused)
3c90x 22884 2 (autoclean)

[7.4.] SCSI information (from /proc/scsi/scsi)

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: IBM Model: DORS-32160 Rev: WA6A
  Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 03 Lun: 00
  Vendor: NEC Model: CD-ROM DRIVE:222 Rev: 3.0i
  Type: CD-ROM ANSI SCSI revision: 02

I'm not subscribed, but will check the archive on tux for responses.
However, if anyone has any good guesses on finally ending these crashes,
direct email will be most welcome. Thanks.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

This archive was generated by hypermail 2b29 : Mon May 15 2000 - 21:00:10 EST