IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not re-do failed op?

From: Daniel B.
Date: Mon Oct 06 2003 - 13:45:11 EST


I just got bitten _again_ by IDE DMA timeout errors and massive
filesystem corruption in kernel 2.4.22 (on an Asus A7M266-D dual-Athlon
XP motherboard (AMD 768 chip / amd7441 IDE controller)).

(I had turned DMA off in my init scripts, but apparently Debian
unstable's k7-smp configuration enables DMA by default before my init
scripts get control. Ext3 journal "recovery" trashed my system
partition.)

What's going on with the IDE DMA bugs? They have existed since 2.2
(right?), and even at .22 in the 2.4 series they still exist. Why
have they been around so long? Is it that few kernel developers use
the combinations of hardware or configuration options that expose
the bugs (like my dual-CPU box with IDE, not SCSI, disks)?

Are the DMA bugs believed to be fixed (for real) yet? IF so, in which
version?

Is there any consolidated documentation of the combinations of factors
that cause corruption, or of how to reliably avoid corruption (like
all the things to check to make sure your kernel never even tries to
enable DMA)?


Also, why does a DMA timeout cause such corruption? Doesn't the kernel
keep track of uncompleted operations, retain the information needed to
try again, and try again if there's a failure? If not, why not?

If it can't try again, shouldn't the kernel at least abort after one
disk-write failure instead of performing additional writes, which
frequently depend on the previous writes? (E.g., if I try to read
block 1's data and write it to block 2, and then write something new
to block 1, if the first write fails but continue and do the second
write, data gets destroyed. If the first write fails and I stop right
away, less is destroyed.)




Daniel
--
Daniel Barclay
dsb@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/