Re: work queue of scsi fc transports should be serialized

From: Martin Wilck
Date: Mon May 22 2017 - 16:05:04 EST


On Sat, 2017-05-20 at 08:25 +0000, Dashi DS1 Cao wrote:
> On Fri, 2017-05-19 at 09:36 +0000, Dashi DS1 Cao wrote:
> > It seems there is a race of multiple "fc_starget_delete" of the
> > sameÂ
> > rport, thus of the same SCSI host. The race leads to the race ofÂ
> > scsi_remove_target and it cannot be prevented by the code snippetÂ
> > alone, even of the most recent
> > version:
> > ÂÂÂÂÂÂÂÂspin_lock_irqsave(shost->host_lock, flags);
> > ÂÂÂÂÂÂÂÂlist_for_each_entry(starget, &shost->__targets, siblings) {
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂif (starget->state == STARGET_DEL ||
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂstarget->state == STARGET_REMOVE)
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂcontinue;
> > If there is a possibility that the starget is under deletion(state
> > ==Â
> > STARGET_DEL), it should be possible that list_next_entry(starget,Â
> > siblings) could cause a read access violation.
> > Hello Dashi,
> > Something else must be going on. From scsi_remove_target():
> > restart:
> > spin_lock_irqsave(shost->host_lock, flags);
> > list_for_each_entry(starget, &shost->__targets, siblings) {
> > if (starget->state == STARGET_DEL ||
> > ÂÂÂÂstarget->state == STARGET_REMOVE)
> > continue;
> > if (starget->dev.parent == dev || &starget->dev == dev)
> > {
> > kref_get(&starget->reap_ref);
> > starget->state = STARGET_REMOVE;
> > spin_unlock_irqrestore(shost->host_lock,
> > flags);
> > __scsi_remove_target(starget);
> > scsi_target_reap(starget);
> > goto restart;
> > }
> > }
> > spin_unlock_irqrestore(shost->host_lock, flags);
> > In other words, before scsi_remove_target() decides to call
> > __scsi_remove_target(), it changes the target state into
> > STARGET_REMOVE while holding the host lock.Â
> > This means that scsi_remove_target() won't
> > callÂ__scsi_remove_target() twice and also that it won't invoke
> > list_next_entry(starget, siblings) after starget has beenÂ
> > freed.
> > Bart.
>
> In the crashes of Suse 12 sp1, the root cause is the deletion of a
> list node without holding the lock:
> ÂÂÂÂÂÂÂÂspin_lock_irqsave(shost->host_lock, flags);
> ÂÂÂÂÂÂÂÂlist_for_each_entry_safe(starget, tmp, &shost->__targets,
> siblings) {
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂif (starget->state == STARGET_DEL)
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂcontinue;
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂif (starget->dev.parent == dev || &starget->dev ==
> dev) {
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ/* assuming new targets arrive at the end */
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂkref_get(&starget->reap_ref);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂspin_unlock_irqrestore(shost->host_lock,
> flags);
>
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ__scsi_remove_target(starget);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂlist_move_tail(&starget->siblings,
> &reap_list);ÂÂ--this deletion from shost->__targets list is done
> without the lock.
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂspin_lock_irqsave(shost->host_lock, flags);
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ}
> ÂÂÂÂÂÂÂÂÂÂ}
> ÂÂÂÂÂÂÂÂÂÂspin_unlock_irqrestore(shost->host_lock, flags);

I believe this is fixed in SLES12-SP1 kernel 3.12.53-60.30.1, with the
following patch:

* Mon Jan 18 2016 jthumshirn@xxxxxxx
- scsi: restart list search after unlock in scsi_remove_target
 (bsc#944749, bsc#959257).
- Delete
 patches.fixes/0001-SCSI-Fix-hard-lockup-in-scsi_remove_target.patch.
- commit 2490876

Regards,
Martin

--
Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel.Â+49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix ImendÃrffer, Jane Smithard, Graham Norton
HRB 21284 (AG NÃrnberg)