Re: libata: dma, io error messages

From: Paul Jakma
Date: Mon Aug 16 2004 - 19:50:41 EST

Next message: john stultz: "Re: boot time, process start time, and NOW time"
Previous message: Miguel Bolanos: "test"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Jeff,

On Fri, 6 Aug 2004, Jeff Garzik wrote:

libata does not (yet) retry cable errors, for example. Paul, don't automatically assume the disk is bad, try swapping cables.

FWIW:

- running dd in a while loop reading 1024k of data at a time from about 1000 odd blocks before/after the block number mentioned in the error messages did not cause any further errors

however:

- when I tried to add the disk (well partition, spanning from 2GB to end of disk at 160GB) back into its RAID-5 array, it errored out every time when resyncing the array, but only when reaching somewhere close to the end of the resync.

(unfortunately, I couldnt get the error messages, it caused all further IO to the RAID-5 to die too and effectively hang the box - i think due to the MD layer not being at all happy with a disk failling while resyncing, indeed, MD was not happy with state of play after reboot and wouldnt bring the array back.[1])

- Rogier Wolff reported to me in private mail that in his experience, WD disks going 'slow' is a WD quirk and an imminent sign of failure of WD disks.

And indeed, the disk did die completely last friday.. it's now been replaced.

How much would it take to get SMART reporting working with libata btw? I'd be very interested in this and helping out if possible.

Jeff

1. NB: It would be nice if the Linux RAID superblock had per-drive UUIDs in addition to the global array UUID. I hotadded the disk to the array without first hotremoving it, and MD was rather confused by the presence of '/dev/sda' twice, as a failed disk and as a spare. Also, the failure of /dev/sda caused /dev/sdb and /dev/sdc to be renamed, after reboot to /dev/sda and /dev/sdb, which further confused matters. A per-component UUID might help MD discriminate and prevent addition of disks which are already in array and also help it recognise that the failed /dev/sda3 in the superblock is not the same as the current (renamed due to discovery after disk failure), /dev/sda3 existing on the system..

It took a lot of touching of wood, checking and rechecking of /etc/raidtab vs lsraid and mdadm -E before doing mkraid --dangerous-no-resync .... to rewrite the superblock and get the array running again despite my mistake and the not-very-friendly-to-fat-fingers properties of md.

regards,
--
Paul Jakma paul@xxxxxxxx paul@xxxxxxxxx Key ID: 64A2FF6A
Fortune:
If only God would give me some clear sign! Like making a large deposit
in my name at a Swiss Bank.
- Woody Allen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: john stultz: "Re: boot time, process start time, and NOW time"
Previous message: Miguel Bolanos: "test"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]