Re: [PATCH] PCI/IOV: update num_VFs earlier

From: Bjorn Helgaas
Date: Tue Oct 01 2019 - 19:45:25 EST


On Fri, Apr 26, 2019 at 10:11:54AM +0200, CREGUT Pierre IMT/OLN wrote:
> I also initially thought that kobject_uevent generated the netlink event
> but this is not the case. This is generated by the specific driver in use.
> For the Intel i40e driver, this is the call to i40e_do_reset_safe in
> i40e_pci_sriov_configure that sends the event.
> It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
> finally calls the generic pci_enable_sriov function.

I don't know anything about netlink. The script from the bugzilla
(https://bugzilla.kernel.org/show_bug.cgi?id=202991) looks like it
runs

ip monitor dev enp9s0f2

What are the actual netlink events you see? Are they related to a
device being removed?

When we change num_VFs, I think we have to disable any existing VFs
before enabling the new num_VFs, so if you trigger on a netlink
"remove" event, I wouldn't be surprised that reading sriov_numvfs
would give a zero until the new VFs are enabled.

> So the proposed patch works well for the i40e driver (x710 cards) because
> the update to num_VFs is fast enough to be committed before the event is
> received. It may not work with other cards. The same is true for the zero
> value and there is no guarantee for other cards.
>
> The clean solution would be to lock the device in sriov_numvfs_show.
> I guess that there are good reasons why locks have been avoided
> in sysfs getter functions so let us explore other approaches.
>
> We can either return a "not settled" value (-1) or (probably better)
> do not return a value but an error (-EAGAIN returned by the show
> function).
>
> To distinguish this "not settled" situation we can either:
> * overload the meaning of num_VFs (eg make it negative)
> but it is an unsigned short.
> * add a bool to pci_sriov struct (rather simple but modifies a well
> established structure).
> * use the fact that not_settled => device is locked and use
> mutex_is_locked as an over approximation.
>
> The later is not perfect but requires minimal changes to
> sriov_numvfs_show:
>
>  if (mutex_is_locked(&dev->mutex))
> return -EAGAIN;

I thought this was a good idea, but

- It does break the device_lock() encapsulation a little bit:
sriov_numvfs_store() uses device_lock(), which happens to be
implemented as "mutex_lock(&dev->mutex)", but we really shouldn't
rely on that implementation, and

- The netlink events are being generated via the NIC driver, and I'm
a little hesitant about changing the PCI core to deal with timing
issues "over there".

> In all cases, the device could be locked or the boolean set just
> after the test. But I don't think there is a case where causality
> would be violated.Thank you in advance for your recommendations. I will
> update the patch according to your instructions.
>
> Le 06/04/2019 à 00:33, Bjorn Helgaas a écrit :
> > On Fri, Mar 29, 2019 at 09:00:58AM +0100, Pierre Crégut wrote:
> > > Ensure that iov->num_VFs is set before a netlink message is sent
> > > when the number of VFs is changed. Only the path for num_VFs > 0
> > > is affected. The path for num_VFs = 0 is already correct.
> > >
> > > Monitoring programs can relie on netlink messages to track interface
> > > change and query their state in /sys. But when sriov_numvfs is set to a
> > > positive value, the netlink message is sent before the value is available
> > > in sysfs. The value read after the message is received is always zero.
> > Thanks, Pierre! Can you clue me in on where exactly the connection
> > from sriov_enable() to netlink is?
> >
> > I see one side of the race is with sriov_numvfs_show(), but I don't
> > know where the netlink message is sent. Is that connected with the
> > kobject_uevent(KOBJ_CHANGE)?
> >
> > One thing this would help with is figuring out exactly how *much*
> > earlier we need to set iov->num_VFs. It looks like the current patch
> > sets it before we actually enable the VFs, so a user could read
> > /sys/.../sriov_numvfs and get the wrong value. Of course, that's
> > unavoidable; the question is whether it's OK to get the new value
> > *before* it actually takes effect, or whether we want to return a
> > stale value until after it takes effect.
> >
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
> > > Signed-off-by: Pierre Crégut <pierre.cregut@xxxxxxxxxx>
> > > ---
> > > note: the behaviour can be tested with the following shell script also
> > > available on the bugzilla (d being the phy device name):
> > >
> > > ip monitor dev $d | grep --line-buffered "^[0-9]*:" | \
> > > while read line; do cat /sys/class/net/$d/device/sriov_numvfs; done
> > >
> > > drivers/pci/iov.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > index 3aa115ed3a65..a9655c10e87f 100644
> > > --- a/drivers/pci/iov.c
> > > +++ b/drivers/pci/iov.c
> > > @@ -351,6 +351,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > > goto err_pcibios;
> > > }
> > > + iov->num_VFs = nr_virtfn;
> > > pci_iov_set_numvfs(dev, nr_virtfn);
> > > iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
> > > pci_cfg_access_lock(dev);
> > > @@ -363,7 +364,6 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > > goto err_pcibios;
> > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
> > > - iov->num_VFs = nr_virtfn;
> > > return 0;
> > > @@ -379,6 +379,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > > if (iov->link != dev->devfn)
> > > sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > + iov->num_VFs = 0;
> > > pci_iov_set_numvfs(dev, 0);
> > > return rc;
> > > }
> > > --
> > > 2.17.1
> > >