RE: [PATCH] usb: uas: fix usb subsystem hang after power off hub port

From: Alan Stern
Date: Thu Apr 04 2019 - 15:33:41 EST


On Thu, 4 Apr 2019 Kento.A.Kobayashi@xxxxxxxx wrote:

> Hi,
>
> >> Root Cause
> >> - Block layer timeout happens after power off UAS USB device which is accessed as reproduce step. During timeout error handler process, scsi host state becomes SHOST_CANCEL_RECOVERY that causes IO hangs up and lock cannot be released. And in final, usb subsystem hangs up.
> >> Follow is function call:
> >> blk_mq_timeout_work
> >> â->scsi_times_out (â means some functions are not listed before this function.)
> >> â-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY)
> >> â -> scsi_error_handler
> >> â-> uas_eh_device_reset_handler
> >> -> usb_lock_device_for_reset <- take lock
> >> -> usb_reset_device
> >> â-> rebind = uas_post_reset (return 1 since ENODEV)
> >> â-> usb_unbind_and_rebind_marked_interfaces (rebind=1)
> >> â-> uas_disconnect (scsi_host_set_state to SHOST_CANCEL_RECOVERY)
> >> â -> scsi_queue_rq
>> -> scsi_host_queue_ready(return 0 causes IO hangs up.)
> >
> >How does scsi_queue_rq get called here? As far as I can see, this shouldn't happen.
>
> We confirmed the function call path on linux 4.9 when this problem occured since we are working on it. In linux 4.9, the last function is scsi_request_fn instead of scsi_queue_rq. In staging.git, we think the scsi_queue_rq is called by follow path.
> uas_disconnect
> |- scsi_remove_host
> |- scsi_forget_host
> |- __scsi_remove_device
> |- device_del
> |- bus_remove_device
> |- device_release_driver
> |- device_release_driver_internal
> |- __device_release_driver
> |- drv->remove(dev) (sd_remove)
> |- sd_shutdown
> |- sd_sync_cache
> |- scsi_execute
... (unnecessary internal details elided)
> |- blk_mq_dispatch_rq_list
> |- q->mq_ops->queue_rq (scsi_queue_rq)

So it looks as though the SCSI subsystem doesn't like to have a reset
handler call scsi_remove_host. Commands dispatched by the removal
routines are forced to wait for the reset recovery to finish, which
won't happen until those commands have been completed.

Is this a bug in the SCSI core? If not, we need to know what is the
right way to do things when a reset handler detects that the SCSI host
has been hot-unplugged.

James, Martin, any suggestions?

Alan Stern