Linux Problem

Joshua J. Berry (jberry@monte.mvhs.srvusd.k12.ca.us)
Thu, 8 Jul 1999 12:21:03 -0700


Hello. I am having a bit of a problem with my Linux box and I was wondering
if any of you on the mailing lists I'm sending to could help me.

SYSTEM CONFIGURATION INFO:

Processor: Intel Pentium II 333MHz (333.41 bogomips reported by kernel)
RAM: 64 MB with 31 MB swap disk
Hard Disk: 4.3 GB
CD-ROM: 32x IDE/ATAPI
OS: Dual boot (using LILO) between Windows 98 and
Linux-Mandrake 6.0 (a variant of Red Hat 6) with kernel 2.2.9
Video: Trident 3Dimage975 AGP with 4MB VRAM
Sound: Yamaha OPL3/SAx
Network: None

CONTACT INFORMATION:

Full Name: Joshua J. Berry (Josh)
E-Mail Address: Please reply to: jberry@monte.mvhs.srvusd.k12.ca.us

PROBLEM DESCRIPTION:

The first time I noticed anything was wrong was when Emacs (in text mode)
crashed with a SIGSEGV. It would crash every time I tried to exit (using
Ctrl-X, Ctrl-C), and wouldn't return me to a shell prompt no matter what I
did. I had to switch to another terminal and kill myself on the first one
to get it working again. A process list showed Emacs still running (even
though I had been logged out on the first terminal), but in an
uninterruptible sleep (ps status 'D'). Killing Emacs (either w/SIGTERM or
SIGKILL) from another terminal didn't reproduce the problem; it was only
when I tried to exit from within the program. I got rid of Emacs and didn't
think any more about it. The coredump was 0 bytes in length.

I got X Windows set up and working with GNOME, using runlevel 5 and gdm
(GNOME display manager). The problem again manifested itself, but this time
in the GNOME panel (sparoadically after I updated the menu) when I clicked
on the main menu icon. It didn't occur all the time, and it eventually
went away. It didn't matter whether I was using the GNOME Menu Editor or
a shell prompt and a text editor to modify the menu. Again, the panel
coredump was 0 bytes in length.

Since the GNOME panel problem went away, I didn't really think about it
again until I started getting "filesystem busy" error messages when trying
to shut down. I went to runlevel 1 and did a ps listing. A few GNOME
applications were still there, and they were also in ps state 'D'. I
discovered the problem occurred whenever I killed the X server (using
Ctrl+Alt+Bksp), switched runlevels, or halted or rebooted the system. It
did NOT occur when I properly logged out using the "Logout" function in
the GNOME menu. On one of these forays into single-user mode I printed
out a ps listing:

>>> I got rid of the TTY column so it would fit in my mail window. All
the D status procs were tty '?'.

USER PID %CPU %MEM VSZ RSS STAT START TIME COMMAND
root 1 0.1 0.4 1156 272 S 13:01 0:04 init [
root 2 0.0 0.0 0 0 SW 13:01 0:00 [kflushd]
root 3 0.0 0.0 0 0 SW 13:01 0:00 [kpiod]
root 4 0.0 0.0 0 0 SW 13:01 0.00 [kswapd]
root 826 0.0 6.1 6520 3892 D 13:07 0:00 panel --sm-config
root 829 0.0 8.3 7572 5244 D 13:07 0:00 gmc --sm-config-p
root 842 0.0 5.2 5968 3336 D 13:07 0:00 gen_util_applet -
root 846 0.0 5.4 6072 3432 D 13:07 0:00 multiload_applet
root 985 0.0 5.2 5968 3332 D 13:10 0:00 gen_util_applet -
root 989 0.0 5.4 6072 3428 D 13:10 0:00 multiload_applet
root 994 0.0 5.5 5740 3516 D 13:10 0:00 gnome-terminal
root 995 0.0 0.0 0 0 Z 13:10 0:00 [gnome-pty-helpe
root 996 0.0 0.0 0 0 Z 13:10 0:00 [tcsh <defunct>]
root 1055 0.0 5.9 6132 3772 D 13:18 0:01 gedit xdie
root 1526 0.2 5.3 5984 3396 D 13:57 0:00 gnomepager_applet
root 1528 0.2 5.3 6072 3400 D 13:57 0:00 gen_util_applet -
root 1532 0.2 5.5 6180 3496 D 13:57 0:00 multiload_applet
root 1804 0.0 0.5 1156 352 S 13:57 o:oo init [
root 1805 0.0 1.6 1928 1052 S 13:57 0:00 /bin/sh
root 1823 0.0 0.9 1356 584 S 13:59 0:00 lpd
root 1855 0.0 1.3 2528 876 R 14:00 0:00 ps axu
root 1856 0.0 1.6 1928 1052 R 14:00 0:00 /bin/sh

This was after I had killed the X server twice and dropped to runlevel 1.
I looked around /proc at some of the dead filesystems, and got something
like this, which (along with the fact that this problem occurs during Emacs'
exit) suggests to me that it might be a problem in the kernel with memory
allocation/deallocation. (LATER ON: I checked the 'cat mem' command on a
process and got the same error message, even though the process wasn't
'undead')

Polaris -> pwd
/proc
Polaris -> cd 1055
Polaris -> cd fd
Polaris -> ls
total 0
lrwx------ 1 root root 64 Jul 7 14:02 0 -> /dev/pts/0
Irwx------ 1 root root 64 Jul 7 14:02 1 -> /dev/pts/0
lrwx------ 1 root root 64 Jul 7 14:02 2 -> /dev/pts/0
Irwx------ 1 root root 64 Jul 7 14:02 3 -> socket:[2613]
l-wx------ 1 root root 64 Jul 7 14:02 44 -> pipe:[2657]
lr-x------ 1 root root 64 Jul 7 14:02 5 -> pipe:[2657]
Polaris -> cd ..
Polaris -> less mem
read error
Polaris -> cat mem
cat: mem: No such process
Polaris -> ls
total 0
-r--r--r-- 1 root root 0 Jul 7 14:02 cmdline
lrwx------ 1 root root 0 Jul 7 14:02 cwd -> /home/root/
-r-------- 1 root root 0 Jul 7 14:02 environ
lrwx------ 1 root root 0 Jul 7 14:02 exe -> /usr/bin/gedit*
dr-x------ 2 root root 0 Jul 7 14:02 fd/
pr--r--r-- 1 root root 0 Jul 7 14:02 maps|
-rw------- 1 root root 0 Jul 7 14:02 mem
lrwx------ 1 root root 0 Jul 7 14:02 root -> //
-r--r--r-- 1 root root 0 Jul 7 14:02 stat
-r--r--r-- 1 root root 0 Jul 7 14:02 statm
-r--r--r-- 1 root root 0 Jul 7 14:02 status
Polaris -> cat cmdline
geditxdiePolaris -> less cmdline
geditxdiePolaris -> exit

I decided after reviewing this to adjust the kernel (messing only with the
stuff in 'make menuconfig') and recompile it. I removed the experimental
stuff and the SCSI support (I have no SCSI devices on my computer), and
recompiled. No difference; the same symptoms with the GNOME programs were
showing up. Zero-length coredumps were being left in / and /home/root (I
was doing all this testing using the root userid).

I then reloaded Emacs because I wanted to try something else in text mode.
I started Emacs, then looked at the /proc/pid/status file:

Name: emacs
State: S (sleeping)
Pid: 512
PPid: 510
Uid: 0 0 0 0
Gid: 0 0 0 0
Groups: 0 1 2 3 4 6 10
VmSize: 6116 kB
VmLck: 0 kB
VmRSS: 2892 kB
VmData: 124 kB
VmStk: 172 kB
VmExe: 916 kB
VmLib: 2772 kB
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000019814eff
CapInh: 00000000fffffeff
CapPrm: 00000000fffffeff
CapEff: 00000000fffffeff

I then tried to exit Emacs normally, which I fully expected to cause a
segfault. The status file looked like this:

Name: emacs
State: D (disk sleep)
Pid: 594
PPid: 497
Uid: 0 0 0 0
Gid: 0 0 0 0
Groups: 0 1 2 3 4 6 10
VmSize: 6116 kB
VmLck: 0 kB
VmRSS: 2956 kB
VmData: 124 kB
VmStk: 172 kB
VmExe: 916 kB
VmLib: 2772 kB
SigPnd: 0000000000000000
SigBlk: 0000000008000000
SigIgn: 0000000010000000
SigCgt: 0000000009814aff
CapInh: 00000000fffffeff
CapPrm: 00000000fffffeff
CapEff: 00000000fffffeff

Interestingly enough, this behavior was also sparoadic. I then tried
KILLing the Emacs process with the following change in the status file:

Name: emacs
State: D (disk sleep)
Pid: 630
PPid: 629
Uid: 0 0 0 0
Gid: 0 0 0 0
Groups: 0 1 2 3 4 6 10
VmSize: 6116 kB
VmLck: 0 kB
VmRSS: 2956 kB
VmData: 124 kB
VmStk: 172 kB
VmExe: 916 kB
VmLib: 2772 kB
SigPnd: 0000000000000100
SigBlk: 0000000008000000
SigIgn: 0000000010000000
SigCgt: 0000000009814aff
CapInh: 00000000fffffeff
CapPrm: 00000000fffffeff
CapEff: 00000000fffffeff

So far as I can tell, neither KILL nor TERM will shut the process down. I
would appreciate any input you can give me.

OTHER FACTS:

Setting the ulimit -c in bash while I was logged in (not my normal
shell...I prefer tcsh) to something didn't change the behavior.

Another interesting side fact (although it may not be pertinent to the
problem itself); when I tried to install StarOffice 5, the install program
said that I didn't have glibc (libc6) installed on my computer. I checked
the package databases, found glibc and verified the package. All was in
order. Could this be a library issue and not a kernel one?

At this point I have run out of ideas. If anyone has any other thoughts on
this, I would be happy to hear them.

Thanks,
Josh

---------------------------------
Joshua J. Berry
Monte Vista/Venture High Schools
Redwell Technology Corp.

Network and Systems Administrator

"Long live the penguin!"
---------------------------------

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/