Re: sata_mv port lockup on hotplug (kernel 2.6.38.2)

From: Tejun Heo
Date: Mon Sep 05 2011 - 23:46:01 EST


Hello,

On Fri, Sep 02, 2011 at 05:22:38PM +0100, Bruce Stenning wrote:
> Unfortunately it has so far been quite difficult to reproduce when specifically
> attempting to. In normal use cases I reproduced it twice by unplugging a drive
> from a RAID array with redundancy intact. This was out of around a dozen
> cycles of waiting until redundancy was restored while the unit was under load,
> popping the disk, reinserting, and triggering a RAID rebuild.

Hmm... that's unfortunate.

> I have only twice managed to trigger a lockup deliberately. In both cases the
> tracing showed a scheduled EH which was subsequently not enacted.
>
> How long could it take for the EH to be enacted? In the lockups that I
> have reproduced it did not seem to have recovered minutes later, but perhaps
> if I had waited longer...? I have noticed that error recovery sometimes backs
> off for 8s and even 33s, but it always warns when that sort of delay is coming
> up.

It should happen pretty quickly. In such cases, fastdrain is
activated and all pending commands are aborted if they complete in 3
secs and then EH should kick in. The backoff is from reset path only
to give breathing time for devices which take long time to spin up and
doesn't apply in this case.

> I shall continue to try to track down why the scheduled EH does not happen.

Can you please add some debug printk's to scsi_schedule_eh() and see
whether scsi_eh_wakeup() is invoked from there? It seems likely that
the problem is caused by race conditions around
SHOST_[CANCEL_]RECOVERY flags.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/