Re: [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls

From: Christoph Hellwig
Date: Tue Jun 06 2017 - 06:48:53 EST


On Tue, Jun 06, 2017 at 12:31:42AM -0500, Bjorn Helgaas wrote:
> OK, sorry to be dense; it's taking me a long time to work out the
> details here. It feels like there should be a general principle to
> help figure out where we need locking, and it would be really awesome
> if we could include that in the changelog. But it's not obvious to me
> what that principle would be.

The principle is very simple: every method in struct device_driver
or structures derived from it like struct pci_driver MUST provide
exclusion vs ->remove. Usuaull by using device_lock().

If we don't provide such an exclusion the method call can race with
a removal in one form or another.

> But I'm still nervous because I think both threads will queue
> nvme_reset_work() work items for the same device, and I'm not sure
> they're prepared to run concurrently.

We had another bug in that area, and the fix for that is hopefully
going to go into the next 4.12-rc.

> I don't really think it should be the driver's responsibility to
> understand issues like this and worry about things like
> nvme_reset_work() running concurrently. So I'm thinking maybe the PCI
> core needs to be a little stricter here, but I don't know exactly
> *how*.
>
> What do you think?

The driver core / bus driver must ensure that method calls don't
race with ->remove. There is nothing the driver can do about it,
and the race is just as possible with explicit user removals or
hardware hotplug.