Re: ide_do_drive_cmd() problems

Jens Axboe (axboe@image.dk)
Sun, 12 Sep 1999 20:07:54 +0200


On Sun, Sep 12 1999, Gadi Oxman wrote:
> If, however, somewhere between the call to down() and the release of
> the irq lock, someone calls local __sti() *and* unluckily, the request
> could have been serviced by the drive quickly, we would attempt to grab
> the irq lock on the processor which already owns it and deadlock.

This sounds like a plausible explanation of what is happening.
I was unable to test with the IKD patch, which reduced my
attempts to fixing this to guessing, mainly.

> In the path called by down(), I can see the run_task_queue() call,
> which is called by schedule() before the release_kernel_lock() call.
> This call, for example, can call unplug_device() for another IDE
> interface which has queued requests, and there we would do an __sti()
> in the ide_do_request() function.. Does moving the run_task_queue()
> call after the release_kernel_lock() call makes any difference with
> the original ide_do_drive_cmd()?

No, it still deadlocks when releasing the kernel lock before
run_task_queue(). I found out the hard way, again!

> BTW, our attempt to minimize the race potential between down() and
> up() is probably no longer needed with current kernels, as the semaphores
> are SMP and interrupt safe. The important part is to install the
> semaphore in the request before grabbing the io_request_lock and adding
> it to the queue, and we should safely be able to remove the save_flags(),
> cli() (both global and local) and restore_flags() around the call to
> down(), and let the standard kernel up()/down() functions handle the
> race between them.

This fixed the problem beatifully! I stressed ide_do_drive_cmd()
vigorously (4 drives active, ripping audio and playing video
CD's) for about an hour without a hitch. This still doesn't
explain why ide_do_drive_cmd_wait() works well too - I haven't
looked at the code generated, though. So could we just loose
the save_flags/cli/restore_flags in ide_do_drive_cmd()?

> On another point, the use of:
>
> struct request *cur_rq = drive->queue;
>
> in the rewritten function below outside of the io_request_lock()
> part is dangerous, as an interrupt could occur just between the
> above line and the spin_lock_irqsave(&io_request_lock, flags); line
> and change cur_rq from under us.

Good point. The function was just a desperate measure to attempt
to get _some_ debugging info out of this. Not even a hooked up
serial console could help.

-- 
*  Jens Axboe <axboe@image.dk>
*  Linux CD-ROM Maintainer
*  http://www.kernel.dk

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/