More on the IDE multiwrite problem

Alan Cox (alan@lxorguk.ukuu.org.uk)
Sat, 18 Dec 1999 00:20:56 +0000 (GMT)


I've been doing some more digging. Having been over the multiwrite code
(with the io unlock fix in it) Im mystified what is going on.

The basic logic seems quite clean, walk the bh list copying sectors
and build up multi-block sized chunks for PIO. It uses a copy of the request
so ought to be immune to any kind of corruption from I/O request races. The
rq it is working on should also be safe from tampering because it is checked
by ll_rw_blk anyway.

The error case appears to have a definite pattern. It is always

multiwrite_intr status=0x50 (or 52) { DriveReady SeekCompete }
buffer list corrupted
[maybe a repeat of the error code]
then either a crash or an I/O error log from end_request

The reports all indicate

2.2.13 is fine
2.2.14preX (7 and higher for definite) fails
2.2.13+ ide.2.2.13.19991111.patch fails

The crash cases look like ide_end_request fed crash to end_that_request_first.
The end of the request comes from ide_multiwrite called from do_rw_disk
from multiwrite_intr from do_rw_disk from start_request from do_rw_request
from ide_do_request from unplug_device. Im not 100% sure some of the trace isnt
noise but its oddly repetitive.

The good news is we can go back to 2.2.13 IDE, and I'd prefer to do this and
live with the odd SMP hangs than this bug which is affecting far more people.

I'll keep on poking around the code. The obvious path for an error would
seem to be

hwgroup->wrq = *rq;
ide_set_handler(drive, &multiwrite_intr, WAIT_CMD)
-> IRQ here <-.
multiwrite_intr gets an error
completes the request
=> END IRQ <-
ide_multiwrite ....
BOOM!

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/