Re: [ofa-general] Re: list corruption on ib_srp load in v2.6.24-rc5

From: David Dillow
Date: Thu Jan 03 2008 - 15:10:33 EST



On Thu, 2008-01-03 at 17:30 +0900, FUJITA Tomonori wrote:
> On Wed, 02 Jan 2008 09:51:38 -0800
> Roland Dreier <rdreier@xxxxxxxxx> wrote:
>
> > > > Can you try this?
> > >
> > > That patched oopsed in scsi_remove_host(), but reversing the order has
> > > survived over 500 insert/probe/remove cycles.
> > >
> > > Tested-by: David Dillow <dillowda@xxxxxxxx>
> > > ---
> > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> > > index 950228f..77e8b90 100644
> > > --- a/drivers/infiniband/ulp/srp/ib_srp.c
> > > +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> > > @@ -2054,6 +2054,7 @@ static void srp_remove_one(struct ib_device *device)
> > > list_for_each_entry_safe(target, tmp_target,
> > > &host->target_list, list) {
> > > scsi_remove_host(target->scsi_host);
> > > + srp_remove_host(target->scsi_host);
> > > srp_disconnect_target(target);
> >
> > Where do we stand on this? What is the right place to put the
> > srp_remove_host? Is there a bug somewhere else?
>
> {sas|fc}_remove_host is called before scsi_remove_host. And in
> srp_remove_work(), we call srp_remove_host and then
> scsi_remove_host. ibmvscsi also calls them in that order.
>
> I thought that I messed up something in srp_transport_class. But I
> can't figure out what's wrong. The above patch works and is unlikely
> to lead to critical problems so I'm fine with it for now.

I added some debugging printk's -- the first word is the function name:
printk(KERN_DEBUG "ib_srp:srp_remove_one %p %p\n", target,
target->scsi_host);

printk(KERN_DEBUG "srp_rport_del %p %p %p %s\n", shost, rport, dev,
dev->kobj.k_name);

printk(KERN_DEBUG "transport_remove_dev %p %d\n", dev,
atomic_read(&dev->kobj.kref.refcount));

printk(KERN_DEBUG "transport_remove_classdev %p\n", dev);

printk(KERN_DEBUG "scsi_target_reap_usercontext %p %p %p\n", shost,
starget, &starget->dev);


And the dmesg output:

ib_srp:srp_remove_one ffff810845498450 ffff810845498000
srp_rport_del ffff810845498000 ffff8108450d6000 ffff8108450d6000 port-3:1
transport_remove_dev ffff8108450d6000 4
transport_remove_classdev ffff8108450d6000
srp_rport_del done
srp_rport_del ffff810845498000 ffff810845123028 ffff810845123028 target3:0:0
transport_remove_dev ffff810845123028 9
srp_rport_del done
transport_remove_dev ffff81084557f920 6
sd 0:0:0:0: [sda] Synchronizing SCSI cache
sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
transport_remove_dev ffff8108454f6920 6
sd 0:0:0:1: [sdb] Synchronizing SCSI cache
sd 0:0:0:1: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
scsi_target_reap_usercontext ffff810845498000 ffff810845123000 ffff810845123028
transport_remove_dev ffff810845123028 2


It looks like srp_rport_del() is getting called for a device object it
doesn't own -- target3:0:0. And when scsi_remove_host() goes to remove
it, it is already gone.

Adding

if (strncpy(dev->kobject.k_name, "port-", 5))
return;

to the top of srp_rport_del() fixes the oops, so that seems to confirm
my hypothesis.

When scsi_remove_host() is called before srp_remove_host(), it removes
that "target3:0:0" entry, and all is happy, so the fix already posted
should be fine for 2.6.24, though we may want to fix up
srp_remove_work() as well -- I've not looked at it to see if it would
have the same problem.

As for a better fix, I'm not sure. I'll go out on a limb and bet the
other users of srp_remove_host() may have the same issue.
Dave


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/