Ooops on snmpd and 2.2.15pre17

From: Anders Henke (anders@schlund.de)
Date: Tue Apr 18 2000 - 12:28:37 EST


Hi,

we're using Compaq Dual-PIII-Proliants 1850 with Smart2 (cpqarray)-RAID-arrays,
384 to 768 meg ECC-RAM, onboard-tlan-nics. Due to excessive problems with
panics, oopses and similar problems as desj@google.com described them
in january, we updated our hardware's firmware and bios to current releases,
verified settings, kicked out tcp_retransmit_try_collapse,
included kmsgdump and upgraded to 2.2.15pre17 - for three days none of 45
machines crashed, then the following two days three out of four machines
paniced and the fourth one froze. kmsgdump wasn't able to dump correctly
to the disk, we were not able to capture any dumps.

Our general problems are the same as desj@google.com described them on
Jan 20th - the kernel crashes in tcp_retransmit_try_collapse by
dereferencing by zero, the kernel oopses, panics and kills the interrupt
handlers - and from there on the system is virtually dead.

Today we received a different problem - kmsgdump was not called, the
system froze immediately after panicing:
---cut
Oops: 0000
CPU: 0
EIP: 0010:[<00000000>]
EFLAGS: 00010287
eax: 9d0be628 ebx: 00000000 ecx: 80210b14 edx: 9eef1560
esi: 00000000 edi: 00000001 ebp: 8d1b7cc0 esp: 8d1b7cb4
ds: 0018 es: 0018 ss: 0018
Process snmpd (pid: 29215, process nr: 408, stackpage=8d1b7000)
Stack: 00000001 8024b844 00000000 8d1b7cdc 8011bdfd 00001e17 9f76bbc2 9b2e9ec0
         8010b7e1 95101b50 9ffe64a0 8010a82c 95101b50 00000000 00000001 9f76bbc2
         9b2e9ec0 9ffe64a0 00000000 9f760018 95100018 ffffff00 80170148 00000010
Call Trace: [<8011bdfd>] [<8010b7e1>] [<8010a82c>] [<80170148>] [<8010b662>] [<8
0170099>] [<801713ba>] [<8014e38b>] [<8015bc70>] [<80157a63>] [<80167f6a>] [<801
67b48>] [<8016c7fb>] [<8016c76c>] [<8014ca5e>] [<8016c76c>] [<8014d72f>] [<80121
55c>] [<80126fcf>] [<8011e6f0>] [<80120143>] [<801203ee>] [<8014e077>] [<801097e
c>] [<8010002b>]
Code: <1>Unable to handle kernel NULL pointer dereference at virtual address 000
00000 current->tss.cr3 = 137b4000, %cr3 = 137b4000 *pde = 00000000
Warning: trailing garbage ignored on Code: line
  Text: 'Code: <1>Unable to handle kernel NULL pointer dereference at virtual ad
dress 00000000 current->tss.cr3 = 137b4000, %cr3 = 137b4000 *pde = 00000000'
  Garbage: 'Unable to handle kernel NULL pointer dereference at virtual address
00000000 current->tss.cr3 = 137b4000, %cr3 = 137b4000 *pde = 00000000'
Warning, Code looks like message, not hex digits. No disassembly attempted.

>>EIP: 00000000 Before first symbol
Trace: 8011bdfd <do_bottom_half+85/a8>
Trace: 8010b7e1 <do_IRQ+4d/54>
Trace: 8010a82c <common_interrupt+18/20>
Trace: 80170148 <ip_fw_check+330/514>
Trace: 8010b662 <handle_IRQ_event+5a/90>
Trace: 80170099 <ip_fw_check+281/514>
Trace: 801713ba <ipfw_output_check+76/80>
Trace: 8014e38b <call_out_firewall+2f/4c>
Trace: 8015bc70 <ip_build_xmit+24c/304>
Trace: 80157a63 <ip_route_output+6b/fc>
Trace: 80167f6a <udp_sendmsg+302/34c>
Trace: 80167b48 <udp_getfrag+0/dc>
Trace: 8016c7fb <inet_sendmsg+8f/a4>
Trace: 8016c76c <inet_sendmsg+0/a4>
Trace: 8014ca5e <sock_sendmsg+8a/b0>
Trace: 8016c76c <inet_sendmsg+0/a4>
Trace: 8014d72f <sys_sendto+e3/11c>
Trace: 8012155c <do_generic_file_read+5f4/600>
Trace: 80126fcf <free_page_and_swap_cache+5f/64>
Trace: 8011e6f0 <zap_page_range+144/1c4>
Trace: 80120143 <unmap_fixup+11b/120>
Trace: 801203ee <do_munmap+21e/234>
Trace: 8014e077 <sys_socketcall+17b/248>
Trace: 801097ec <system_call+34/38>
Trace: 8010002b <startup_32+2b/a4>
---cut

A packet enters the firewalling-checks, snmpd builds its answering packet,
the answer passes the outgoing firewalling-checks, the kernel packages
the final packet, sets its routing information and sends it out;
munmap is called, a system_call is started and then startup is being
called. snmpd executes external binaries for its answer, this sets
the do_munmap; system_call is the last action of snmpd, "gettimeoftheday",
after this snmpd normally returns to its SELECT.

Any ideas out there for my problems?

yours,

Anders
Please cc: answers to my adress, I only read the weekly summary.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 23 2000 - 21:00:13 EST