Re: SCSI Kernel Problem - BAD

jam@mnsinc.com
Fri, 15 Mar 1996 14:16:11 -0500


Here is another possibly related data point on a 1542 that perhaps
resembles what Eric described.

I was experimenting with the effects of changing the DMA speed on the
1542 BIOS set up menu (my newly installed Seagate ST-43400N seemed
dreadfully slow compared to my old Fujitsu M2624/T running off a local
bus caching IDE controller). On my 486DX50 local bus system the 10
and 8 MB/s settings prevented the boot from completing the 1542 stuff.
The 6.7 MB/s setting seemed to work, so, as a sanity check, I started
several large scale copies to and from the same drive, from another
drive on the same controller, and to a tape on the same controller. I
also started a number of compares between fresh copies and originals.

My sanity was first challenged by large numbers of miss-compares and
then a flood of file system errors. The error messages and diff
outputs were so heavy that I hit the power switch to try to contain
the possible file corruption. I attributed this to having the DMA
speed set too high and so proceeded to set the 1542 DMA speed back to
the default 5 MB/s, reboot, and do a manual fsck. Then I removed all
the new copies and proceeded to let the Red Hat package management
system (rpm -Va) check the integrity of the partitions (/usr and /opt)
that had been involved on the write side of the test.

There were some missing files, and like Eric's experience, a number of
files containing unrelated data. I think all that I saw both while
doing the fsck and from Red Hat's MD5 and other checks could
hypothetically be attributed to bad file system meta data. I saw
nothing that suggested to me that the actual data within a file was
itself corrupted. I have found no errors on partitions other than
those written to in the test.

I still have /var/log/messages from the experiment. An unsystematic
sample of the messages follows.

Mar 12 09:00:05 athene kernel: Linux version 1.3.70 (root@athene.mnsinc.com) (gcc version 2.7.0) #1 Sun Mar 3 15:03:12 EST 1996
...
Mar 12 09:34:27 athene kernel: EXT2-fs warning (device 08:14): ext2_free_inode: bit already cleared for inode 131114
...
Mar 12 09:34:28 athene kernel: attempt to access beyond end of device
Mar 12 09:34:28 athene kernel: 08:13: rw=0, want=134228481, limit=934912
...
Mar 12 09:34:28 athene kernel: EXT2-fs error (device 08:13): ext2_find_entry: bad entry in directory #181625: rec_len % 4 != 0 - offset=0, inode=3675377092, rec_len=19397, name_len=2878
Mar 12 09:34:29 athene kernel: EXT2-fs warning (device 08:13): ext2_free_blocks: bit already cleared for block 3335
Mar 12 09:34:29 athene kernel: EXT2-fs error (device 08:13): ext2_free_blocks: Freeing blocks not in datazone - block = 134228480, count = 1

(I'd be happy to mail the whole mess to anyone that want's it.)

If it would serve a purpose I could clear a partition and do a number
of concurrent copies into it in an attempt to repeat this perhaps
without touching good files.

>>>>> "Eric" == Eric Youngdale "Re: SCSI Kernel Problem - BAD"
>>>>> Fri, 15 Mar 1996 10:24:19 -0500

[[ ... ]]

Eric> I saw one myself just the other day. Someone was
Eric> copying large files off of a cdrom, and putting onto disk.
Eric> Something went wrong (all messages were lost), and there was
Eric> massive disk corruption. The /lib directory was nuked. The
Eric> passwd file was found instead in inetd.conf. /bin/sh was a
Eric> C program. Stuff like that. Good thing I keep a bootable
Eric> partition on an IDE disk, and it was also lucky that the
Eric> system was more or less just an image copy of the Red Hat
Eric> 2.0 live cdrom (i.e. I could just copy /lib back, and
Eric> anything else that looked like it might have been nuked).

[[ ... ]]

jam