ext2-fs errors/corruption?

Todd J Derr (infidel+@pitt.edu)
Sun, 7 Jul 1996 15:59:48 -0400 (EDT)


Hello,

I have a slew of ext2-fs errors in my syslog today. The
machine in question is a Triton (ASUS) P133, the disk is on an NCR
53c810. Kernel is 1.3.100. In .config, I have:

CONFIG_SCSI_NCR53C7xx=y
CONFIG_SCSI_NCR53C7xx_sync=y
CONFIG_SCSI_NCR53C7xx_FAST=y
CONFIG_SCSI_NCR53C7xx_DISCONNECT=y

I don't see any errors in the syslog about physical I/O errors, just
some ext2fs warnings followed by a slew of errors.

Some background about the app that's running. We send out a large
(~45k addresses) mailing list at night (1am-7am) using some mail
software that I wrote. The software basically forks a lot and
delivers in parallel (up to 90 simultaneous connections). Each
process creates and opens a file in a directory called 'debug'. So,
the access pattern for the debug directory is many processes opening
and closing a lot of files (~15k files per night) simultaneously.
Yesterday was special because we also sent the FAQ starting around
5:30pm. The FAQ uses a different directory than the normal mailing.

First, here's what tune2fs -l /dev/sda3 says. As you can see, the
machine has been up for 20 days (since 16 June).

tune2fs 1.02, 16-Jan-96 for EXT2 FS 0.5b, 95/08/09
Filesystem magic number: 0xEF53
Filesystem state: not clean with errors
Errors behavior: Continue
Inode count: 219136
Block count: 873642
Reserved block count: 43682
Free blocks: 241676
Free inodes: 141173
First block: 1
Block size: 1024
Fragment size: 1024
Blocks per group: 8192
Fragments per group: 8192
Inodes per group: 2048
Last mount time: Sun Jun 16 21:48:33 1996
Last write time: Sun Jul 7 15:33:04 1996
Mount count: 2
Maximum mount count: 20
Last checked: Wed May 29 00:35:14 1996
Check interval: 0
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)

The first errors are when we tried to delete the FAQ debug directory
from the month before. Two of the files could not be deleted, and at
this time, we got some messages in the syslog:

Jul 6 17:19:47 wordsmith kernel: EXT2-fs warning (device 08:03):
ext2_unlink: Deleting nonexistent file (54250), 0
Jul 6 17:19:47 wordsmith kernel: EXT2-fs warning (device 08:03):
ext2_free_inode: bit already cleared for inode 54250

these 2 messages are repeated for inodes 54251 and 54252. These files
are gone (can't find them on the disk with find -inum), however, the 2
files remaining in the directory are inodes 54249 and 54253, and their
inodes appear to be corrupt (all 1's):

#ls -li
total 0
54249 ?rwsrwsrwt 65535 65535 65535 4294967295 Dec 31 1969
Dniftyserve.or.jp.10598
54253 ?rwsrwsrwt 65535 65535 65535 4294967295 Dec 31 1969
Duconect.net.10613

They didn't get deleted because a normal user was trying to delete
them and the uid's don't match. You get EPERM. I have not tried
deleting them as root (I'm a bit scared to do so :)

The next messages we get are at 9:10pm, while the FAQ is running.
This pair of messages repeats (exactly) a total of 5 times in the
space of 3 seconds. The directory mentioned (77847) is the debug
directory, the inode number is bogus.

Jul 6 21:10:57 wordsmith kernel: EXT2-fs error (device 08:03):
ext2_find_entry: bad entry in directory #77847: directory entry across blocks -
offset=206848, inode=1638436, rec_len=26436, name_len=24933
Jul 6 21:10:57 wordsmith kernel: EXT2-fs error (device 08:03):
ext2_add_entry:bad entry in directory #77847: directory entry across blocks -
offset=206848, inode=1638436, rec_len=26436, name_len=24933

The next set of errors appear between 3am and 4:17am. This is while
the normal mailing is going on (using a different directory from the
one above). I get a _lot_ of these messages, over 6000 repetitions of
the pair. The directory (73737) is the debug directory:

I get this pair of messages 4045 times from 3:00:30 to 3:08:07:

Jul 7 03:00:30 wordsmith kernel: EXT2-fs error (device 08:03):
ext2_find_entry: bad entry in directory #73737: rec_len % != 0 -
offset=121856, inode=1919053153, rec_len=24937, name_len=25454
Jul 7 03:00:30 wordsmith kernel: EXT2-fs error (device 08:03):
ext2_add_entry: bad entry in directory #73737: rec_len % != 0 -
offset=121856, inode=1919053153, rec_len=24937, name_len=25454

from 3:14:04 to 3:14:05 (3 repetitions), same messages with:
offset=120832, inode=3244, rec_len=3245, name_len=0

3:18:39 to 3:18:53 (47 repetitions):
offset=118784, inode=808595041, rec_len=13878, name_len=55

4:13:03 to 4:17:07 (2049 repetitions):
offset=124928, inode=1919905092, rec_len=24935, name_len=11886

I have a copy of the entire log (~2.5MB) if someone wants it. I'm a
little worried about fs corruption, and this is our production
machine, so I'm going to have to try to repair the damage soon. Were
there any e2fs problems fixed between 1.3.100 and current (2.0.3?) that
would have affected this?

thanks for any help, and I hope we can track this down!

todd.