Re: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Whynotre-do failed op?

From: Ruth Ivimey-Cook
Date: Tue Oct 07 2003 - 07:25:57 EST


On Tue, 7 Oct 2003, Valdis.Kletnieks@xxxxxx wrote:
>On Tue, 07 Oct 2003 01:24:19 EDT, "Daniel B." said:
>> So if some command/batch/etc. wasn't acknowledged, why can't the
>> kernel retry the command/batch/etc.?
>The problem is that the disk ack'ed the command when the block went into the
>write cache. You *DONT* in general get back another ack when the block
>actually hits the platters.

But surely what Daniel is complaining about is that the disk never did ack the
bus transfer.

Consider this as a correct sequence of operations (hope I get it right:-) :


1. Kernel uses IDE controller to initiate ATA disk write request:
a. Kernel sets up DMA parameters (start, length, timeout)
b. kernel initiates transfer of 1 sector to disk
c. (in parallel with b) drive accepts transfer request and waits for data

2. IDE controller DMA used to transfer data to disk unit:
a. hardware DMA sends 256 16-bit words of data to disk
b. (in parallel) drive accepts (acks) each word of data as it comes
over and writes it into internal buffer (be it a write cache or
just a staging area).

3. Transfer complete actions: when the required number of words are acked:
a. IDE DMA controller fires end-of-transfer IRQ
b. (in parallel) if write cache enabled, disk makes sector available to
be written to disk (e.g. by linking the buffer into the write cache)
or, if write cache is disabled, initiates transfer to platter.

4. Kernel sees end of transfer IRQ and initiates software ACK of transfer,
e.g. to remove DMA buffer from 'block dirty' list.

5. If caching enabled, some time later the data in the drive is written to
the platter.


Now, the case I believe Daniel is complaining about is that things go well
through step 1 and perhaps some part of step 2. But, because the drive doesn't
accept the data or some other error, step 3 doesn't happen. Consequently, the
IDE DMA timeout happens, the kernel cries foul and things go wrong. So the
failure actually looks like this:


1. Kernel uses IDE controller to initiate ATA disk write request:
a. Kernel sets up DMA parameters (start, length, timeout)
b. kernel initiates transfer of 1 sector to disk
c. (in parallel with b) drive accepts transfer request and waits for data

2. IDE controller DMA used to transfer data to disk unit:
a. hardware DMA tries to send 256 16-bit words of data to disk
b. (in parallel) drive accepts none or, perhaps, some data from bus into
internal buffer, but not all of it.

3. After waiting, IDE controller fires DMA timeout IRQ.

4. Kernel sees IRQ and emits warning message. Tries to reset bus and ....



Have I got this scenario right?

Ruth

--
Ruth Ivimey-Cook
Software engineer and technical writer.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/