2.0.18: trouble with long-running tar processes

Frank Wales (frank@limitless.co.uk)
Sat, 14 Sep 1996 23:43:29 +0000 (BST)


[This is a bit lengthy, but I'm having apparent file system
problems associated with backups, and it's not clear to me
what data is relevant to diagnosis.]

After the various difficulties I had surrounding both a dodgy
power supply and motherboard (both now replaced), and the problems
with 2.0.15 through 2.0.17, I find that the file server system that
is now running 2.0.18 is stable, with two exceptions that I cannot
explain as anything other than kernel problems, of which this is
the more annoying. Consequently, I'm interested in anyone's opinion
on this, and suggestions for how I might tie it down.

Background: one of the things I do each night is copy various portions
of the file system to a dedicated backup disk, which involves running
a script from cron to tar the backed-up portions of the file system
into files on the backup disk. I use the z option, which spawns a
gzip to compress the archives as they're produced. My problem is
that, while most complete, the backup of /home always seems to
tank the machine after about 40 minutes, and I end up with a mostly
dysfunctional box that doesn't crash or hang, but seems unable to
spawn new processes or do anything else useful, forcing me to power-cycle
the box.

As you can see from this partial df listing, /home is moderately large:

Filesystem 1024-blocks Used Available Capacity Mounted on
...
/dev/sdc5 986112 780516 154653 83% /home
/dev/sda2 836483 443941 392542 53% /backup
...

Compressed, all the drives will fit into /backup with a couple of
hundred meg to spare. I know this because /home used to be
fragmented over a few smaller partitions, but still fitted into less
space than /backup now contains. Since I took the hardware upgrade
as an opportunity to re-organise the disks, however, it has
never successfully completed. The last time it went down, these
three Oopses were recorded in syslog:

Sep 8 03:41:07 gromit kernel: general protection: 0000
Sep 8 03:41:07 gromit kernel: CPU: 0
Sep 8 03:41:07 gromit kernel: EIP: 0010:[<00121a38>]
Sep 8 03:41:07 gromit kernel: EFLAGS: 00010206
Sep 8 03:41:07 gromit kernel: eax: 00000910 ebx: 00000000 ecx: f6000181 edx: 00024624
Sep 8 03:41:07 gromit kernel: esi: 0000047e edi: 000003e8 ebp: 001f6504 esp: 0060eee0
Sep 8 03:41:07 gromit kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Sep 8 03:41:07 gromit kernel: Process tar (pid: 24118, process nr: 54, stackpage=0060e000)
Sep 8 03:41:07 gromit kernel: Stack: 00000000 001e32b0 00000000 00121c71 00ae8c64 0060ef74 0000000c 0158702f
Sep 8 03:41:07 gromit kernel: 00158751 00158784 001f6504 00024624 00000001 00ae8c64 0060ef74 0060ef74
Sep 8 03:41:07 gromit kernel: 0000000c 00ae8c64 00024624 00129c7a 00ae8c64 0158702f 0000000c 0060ef74
Sep 8 03:41:07 gromit kernel: Call Trace: [<00121c71>] [<00158751>] [<00158784>] [<00129c7a>] [<00129e80>] [<00129f18>] [<001281d9>]
Sep 8 03:41:07 gromit kernel: [<0010a4b2>]
Sep 8 03:41:07 gromit kernel: Code: 66 83 79 78 00 75 1c ba e7 03 00 00 8a 41 7c 0a 41 7d 75 03

[ksymoops output:]
>>EIP: 121a38 <get_empty_inode+44/144>
Trace: 121c71 <__iget+61/1e8>
Trace: 158751 <ext2_lookup+81/170>
Trace: 158784 <ext2_lookup+b4/170>
Trace: 129c7a <lookup+da/f4>
Trace: 129e80 <_namei+54/bc>
Trace: 129f18 <lnamei+30/48>
Trace: 1281d9 <sys_newlstat+29/58>
Trace: 10a4b2 <system_call+52/80>

Code: 121a38 <get_empty_inode+44/144> cmpw $0x0,0x78(%ecx)
Code: 121a3d <get_empty_inode+49/144> jne 121a5b <get_empty_inode+67/144>
Code: 121a3f <get_empty_inode+4b/144> movl $0x3e7,%edx
Code: 121a44 <get_empty_inode+50/144> movb 0x7c(%ecx),%al
Code: 121a47 <get_empty_inode+53/144> orb 0x7d(%ecx),%al
Code: 121a4a <get_empty_inode+56/144> jne 121a4f <get_empty_inode+5b/144>
Code: 121a4c <get_empty_inode+58/144> Using `/System.map' to map addresses to symbols.

Sep 8 03:41:08 gromit kernel: general protection: 0000
Sep 8 03:41:08 gromit kernel: CPU: 0
Sep 8 03:41:08 gromit kernel: EIP: 0010:[<00121a38>]
Sep 8 03:41:08 gromit kernel: EFLAGS: 00010202
Sep 8 03:41:08 gromit kernel: eax: 00000910 ebx: 00000000 ecx: f6000181 edx: 001f69c0
Sep 8 03:41:08 gromit kernel: esi: 00000480 edi: 000003e8 ebp: 001f69c0 esp: 00659e90
Sep 8 03:41:08 gromit kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Sep 8 03:41:08 gromit kernel: Process diskbackup (pid: 24117, process nr: 53, stackpage=00659000)
Sep 8 03:41:08 gromit kernel: Stack: 00000000 001e3e40 00000000 00121c71 010b5894 00ffbe9c 00000004 0067900e
Sep 8 03:41:08 gromit kernel: 00ffbe9c 00158827 001f69c0 00008f71 00000001 010b5894 00679013 00659f28
Sep 8 03:41:08 gromit kernel: 00000004 00854034 00008f71 00129c7a 010b5894 0067900e 00000004 00659f28
Sep 8 03:41:08 gromit kernel: Call Trace: [<00121c71>] [<00158827>] [<00129c7a>] [<00129d99>] [<00129fca>] [<0012070d>] [<00120893>]
Sep 8 03:41:08 gromit kernel: [<0010a4b2>]
Sep 8 03:41:08 gromit kernel: Code: 66 83 79 78 00 75 1c ba e7 03 00 00 8a 41 7c 0a 41 7d 75 03

>>EIP: 121a38 <get_empty_inode+44/144>
Trace: 121c71 <__iget+61/1e8>
Trace: 158827 <ext2_lookup+157/170>
Trace: 129c7a <lookup+da/f4>
Trace: 129d99 <dir_namei+95/128>
Trace: 129fca <open_namei+52/3c8>
Trace: 12070d <do_open+59/124>
Trace: 120893 <sys_open+3f/78>
Trace: 10a4b2 <system_call+52/80>

Code: 121a38 <get_empty_inode+44/144> cmpw $0x0,0x78(%ecx)
Code: 121a3d <get_empty_inode+49/144> jne 121a5b <get_empty_inode+67/144>
Code: 121a3f <get_empty_inode+4b/144> movl $0x3e7,%edx
Code: 121a44 <get_empty_inode+50/144> movb 0x7c(%ecx),%al
Code: 121a47 <get_empty_inode+53/144> orb 0x7d(%ecx),%al
Code: 121a4a <get_empty_inode+56/144> jne 121a4f <get_empty_inode+5b/144>
Code: 121a4c <get_empty_inode+58/144> Using `/System.map' to map addresses to symbols.

Sep 8 03:41:08 gromit kernel: general protection: 0000
Sep 8 03:41:08 gromit kernel: CPU: 0
Sep 8 03:41:08 gromit kernel: EIP: 0010:[<0012c818>]
Sep 8 03:41:08 gromit kernel: EFLAGS: 00010286
Sep 8 03:41:08 gromit kernel: eax: f000ef6f ebx: 005ac414 ecx: 00000000 edx: 012311e8
Sep 8 03:41:08 gromit kernel: esi: 00000000 edi: f000ef6f ebp: 005aa810 esp: 00659dd4
Sep 8 03:41:08 gromit kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Sep 8 03:41:08 gromit kernel: Process diskbackup (pid: 24117, process nr: 53, stackpage=00659000)
Sep 8 03:41:08 gromit kernel: Stack: 00000001 0012095b 005ac414 00000000 00000001 00000003 00000001 00115092
Sep 8 03:41:08 gromit kernel: 00000000 0000002b 00000014 0065a000 00659e54 0010aa42 0000000b 001b6607
Sep 8 03:41:08 gromit kernel: 00000000 00000480 000003e8 001f69c0 00000003 02800000 02000000 00000018
Sep 8 03:41:08 gromit kernel: Call Trace: [<0012095b>] [<00115092>] [<0010aa42>] [<02800000>] [<02000000>] [<0010adb0>] [<0010ad88>]
Sep 8 03:41:08 gromit kernel: [<0010a63d>] [<00121a38>] [<00121c71>] [<00158827>] [<00129c7a>] [<00129d99>] [<00129fca>] [<0012070d>]
Sep 8 03:41:08 gromit kernel: [<00120893>] [<0010a4b2>]
Sep 8 03:41:08 gromit kernel: Code: 8b 50 48 85 d2 74 22 f6 42 1c 01 74 0f 53 83 c0 48 50 e8 15

>>EIP: 12c818 <locks_remove_locks+c/38>
Trace: 12095b <close_fp+37/5c>
Trace: 115092 <do_exit+112/1ec>
Trace: 10aa42 <die_if_kernel+2b6/2c0>
Trace: 2800000
Trace: 2000000
Trace: 10adb0 <do_general_protection+28/54>
Trace: 10adb0 <do_general_protection+28/54>
Trace: 10a63d <error_code+3d/50>
Trace: 121a38 <get_empty_inode+44/144>
Trace: 121c71 <__iget+61/1e8>
Trace: 158827 <ext2_lookup+157/170>
Trace: 129c7a <lookup+da/f4>
Trace: 129d99 <dir_namei+95/128>
Trace: 129fca <open_namei+52/3c8>
Trace: 12070d <do_open+59/124>
Trace: 120893 <sys_open+3f/78>
Trace: 10a4b2 <system_call+52/80>

Code: 12c818 <locks_remove_locks+c/38> movl 0x48(%eax),%edx
Code: 12c81b <locks_remove_locks+f/38> testl %edx,%edx
Code: 12c81d <locks_remove_locks+11/38> je 12c841 <locks_remove_locks+35/38>
Code: 12c81f <locks_remove_locks+13/38> testb $0x1,0x1c(%edx)
Code: 12c823 <locks_remove_locks+17/38> je 12c834 <locks_remove_locks+28/38>
Code: 12c825 <locks_remove_locks+19/38> pushl %ebx
Code: 12c826 <locks_remove_locks+1a/38> addl $0x48,%eax
Code: 12c829 <locks_remove_locks+1d/38> pushl %eax
Code: 12c82a <locks_remove_locks+1e/38> call 9090002c <_EIP+9090002c>
Code: 12c82f <locks_remove_locks+23/38> nop

[diskbackup is a shell script]

The backup process started at 03:01, so had been running for 40 mins,
and had produced this much output by that stage:

total 274386
-rw-r--r-- 1 root root 279870738 Sep 8 03:41 home.tgz

Other possibly relevant information is that the base system is
a Pentium 120 based on Slackware release 3.0; I'm using gcc 2.7.2
to build the kernel, but my libc.so is still 5.0.9.

Now, I'm quite happy to bring the libraries up to date and re-build
if that's thought to be a plausible way to fix this, but in that
case I'd appreciate some hints as to the most painless way to do
that, since the last time I tried, I was unable to resolve all the
problems I had with various networking commands breaking when
the libraries were upgraded to 5.3.12 (plus make, but I fixed that).

Oh, here's the output of dmesg from my last boot, in case any
of this is relevant.

[...] 80x25, 1 virtual console (max 63)
pcibios_init : BIOS32 Service Directory structure at 0x000fac90
pcibios_init : BIOS32 Service Directory entry at 0xfb150
pcibios_init : PCI BIOS revision 2.10 entry at 0xfb180
Probing PCI hardware.
Calibrating delay loop.. ok - 47.82 BogoMIPS
Memory: 22092k/24576k available (740k kernel code, 384k reserved, 1360k data)
Swansea University Computer Society NET3.035 for Linux 2.0
NET3: Unix domain sockets 0.12 for Linux NET3.035.
Swansea University Computer Society TCP/IP for NET3.034
IP Protocols: IGMP, ICMP, UDP, TCP
VFS: Diskquotas version dquot_5.6.0 initialized
Checking 386/387 coupling... Ok, fpu using exception 16 error reporting.
Checking 'hlt' instruction... Ok.
Linux version 2.0.18 (root@gromit) (gcc version 2.7.2) #8 Fri Sep 6 00:27:56 BST 1996
Serial driver version 4.13 with no serial options enabled
tty00 at 0x03f8 (irq = 4) is a 16550A
tty01 at 0x02f8 (irq = 3) is a 16550A
lp1 at 0x0378, (polling)
Software Watchdog Timer: 0.04, timer margin: 60 sec
loop: registered device at major 7
ide: i82371 PIIX (Triton) on PCI bus 0 function 57
ide0: BM-DMA at 0x3000-0x3007
ide1: BM-DMA at 0x3008-0x300f
hda: WDC AC2340H, 325MB w/128kB Cache, LBA, CHS=330/32/63
hdb: WDC AC2700F, 696MB w/64kB Cache, LBA, CHS=708/32/63
hdc: NEC CD-ROM DRIVE:272, ATAPI CDROM drive
hdd: WDC AC31600H, 1549MB w/128kB Cache, LBA, CHS=3148/16/63, DMA
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
Floppy drive(s): fd0 is 1.44M
Started kswapd v 1.4.2.2
FDC 0 is an 8272A
aha152x: BIOS test: passed, auto configuration: ok, detected 1 controller(s)
aha152x0: vital data: PORTBASE=0x340, IRQ=11, SCSI ID=7, reconnect=enabled, parity=enabled, synchronous=disabled, delay=100, extended translation=disabled
aha152x: trying software interrupt, ok.
scsi0 : Adaptec 152x SCSI driver; $Revision: 1.17 $
scsi : 1 host.
Vendor: HP Model: HP35470A Rev: 7 09
Type: Sequential-Access ANSI SCSI revision: 02
Detected scsi tape st0 at scsi0, channel 0, id 0, lun 0
Vendor: HP Model: D2645 Rev: bHbH
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sda at scsi0, channel 0, id 1, lun 0
Vendor: HP Model: C2220 : 328 Mb Rev: 0BK2
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdb at scsi0, channel 0, id 5, lun 0
Vendor: HP Model: D2645 Rev: bHbH
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdc at scsi0, channel 0, id 6, lun 0
scsi : detected 1 SCSI tape 3 SCSI disks total.
SCSI device sda: hdwr sector= 512 bytes. Sectors= 2039283 [995 MB] [1.0 GB]
SCSI device sdb: hdwr sector= 512 bytes. Sectors= 641676 [313 MB] [0.3 GB]
SCSI device sdc: hdwr sector= 512 bytes. Sectors= 2039283 [995 MB] [1.0 GB]
ne.c:v1.10 9/23/94 Donald Becker (becker@cesdis.gsfc.nasa.gov)
NE*000 ethercard probe at 0x300: 00 00 e8 ce 74 5b
eth0: NE2000 found at 0x300, using IRQ 5.
Partition check:
sda: sda1 sda2
sdb: sdb1
sdc: sdc1 < sdc5 >
hda: hda1 hda2 hda3 hda4
hdb: hdb1 hdb2 hdb3 hdb4
hdd: hdd1 hdd2
VFS: Mounted root (ext2 filesystem) readonly.
Adding Swap: 66020k swap-space
Adding Swap: 32996k swap-space
hdc: media changed
[...]

Thanks for any suggestions anyone can offer, so that I can put this
annoying problem to bed. FYI, I left memtest86 running on the
box for five hours this evening, and it found no errors after 35 passes.

[The other problem I'm having is that some processes get a segmentation
violation; I've had that mostly during kernel compiles, plus the
odd kswapd or inetd, but others seem to be having that too, so I'm
watching the discussion and accumulating data before I ask for help.]

--
Frank Wales [frank@limitless.co.uk]
"Please state the nature of the Internet emergency."
               --the emergency holographic systems administrator