Re: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Whynotre-do failed op?

From: Daniel B.
Date: Tue Oct 07 2003 - 08:47:05 EST


Ruth Ivimey-Cook wrote:
>
> On Tue, 7 Oct 2003, Valdis.Kletnieks@xxxxxx wrote:
> ...
> But surely what Daniel is complaining about is that the disk never did ack the
> bus transfer.

Yes, much closer to what I meant.

Actually, I'd talking about when the kernel doesn't receive the
acknowledgment (the DMA interrupt).


>
> Consider this as a correct sequence of operations (hope I get it right:-) :
>
> 1. Kernel uses IDE controller to initiate ATA disk write request:
...
> 2. IDE controller DMA used to transfer data to disk unit:
...
> 3. Transfer complete actions: when the required number of words are acked:
> a. IDE DMA controller fires end-of-transfer IRQ
> b. ...
>
> 4. Kernel sees end of transfer IRQ and initiates software ACK of transfer,
> e.g. to remove DMA buffer from 'block dirty' list.
>
> 5. If caching enabled, some time later the data in the drive is written to
> the platter.
>
> Now, the case I believe Daniel is complaining about is that things go well
> through step 1 and perhaps some part of step 2. But, because the drive doesn't
> accept the data or some other error, step 3 doesn't happen.

Actually, I think I'm talking about the very beginning of step 4--
the interrupt request doesn't actually make it to the kernel ("interrupt
lost"?), so the kernel doesn't see the interrupt request.

(I've been assuming that the errors I'm getting are just from DMA
interrupt problems, that is, the drive accepted the data just fine, but
the kernel didn't see the acknowledgement. Of course, it's not clear
how that in itself would cause corruption, so I don't know for sure
that drive-side errors or rejections aren't involved.)


> Consequently, the
> IDE DMA timeout happens, the kernel cries foul and things go wrong. So the
> failure actually looks like this:
>
> 1. Kernel uses IDE controller to initiate ATA disk write request:
> a. Kernel sets up DMA parameters (start, length, timeout)
> b. kernel initiates transfer of 1 sector to disk
> c. (in parallel with b) drive accepts transfer request and waits for data
>
> 2. IDE controller DMA used to transfer data to disk unit:
> a. hardware DMA tries to send 256 16-bit words of data to disk
> b. (in parallel) drive accepts none or, perhaps, some data from bus into
> internal buffer, but not all of it.
>
> 3. After waiting, IDE controller fires DMA timeout IRQ.
>
> 4. Kernel sees IRQ and emits warning message. Tries to reset bus and ....

Actually, I'm thinking of the case where the interrupt request doesn't
make it to the kernel, so the kernel _doesn't_ see any IRQ in the expected
time (and proceeds as you say).


> Have I got this scenario right?

Just about.

(Also, thanks for the DMA details.)



Daniel
--
Daniel Barclay
dsb@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/