Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes"

From: Bjorn Helgaas
Date: Tue Feb 13 2024 - 11:00:03 EST


On Tue, Feb 13, 2024 at 09:34:50AM +0200, Leon Romanovsky wrote:
> On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote:
> > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote:
> > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote:
> > > > On 2/9/24 3:52 PM, Jim Harris wrote:
> > > > > If an SR-IOV enabled device is held by vfio, and the device
> > > > > is removed, vfio will hold device lock and notify userspace
> > > > > of the removal. If userspace reads the sriov_numvfs sysfs
> > > > > entry, that thread will be blocked since sriov_numvfs_show()
> > > > > also tries to acquire the device lock. If that same thread
> > > > > is responsible for releasing the device to vfio, it results
> > > > > in a deadlock.
> > > > >
> > > > > The proper way to detect a change to the num_VFs value is to
> > > > > listen for a sysfs event, not to add a device_lock() on the
> > > > > attribute _show() in the kernel.
> >
> > The lock was not about detecting a change; Pierre did this:
> >
> > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \
> > cat ${path}/device/sriov_numvfs; \
> >
> > which I assume works by listening for sysfs events.
>
> It is not, "ip monitor ..." listens to netlink events emitted by
> netdev core and not sysfs events. Sysfs events are not involved in
> this case.

Thanks for correcting my hasty assumption!

> > The problem was that after the event occurred, the sriov_numvfs
> > read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991).
>
> Yes, and it is outcome of such cross-subsytem involvement, which
> is racy by definition. Someone can come with even simpler example of why
> locking sysfs read and write is not a good idea.
>
> For example, let's consider the following scenario with two CPUs and
> locks on sysfs read and write:
>
> CPU1 CPU2
> echo 1 > ${path}/device/sriov_numvfs
> context_switch ->
> cat ${path}/device/sriov_numvfs
> lock
> return 0
> unlock
> context_switch <-
> lock
> set 1
> unlock
>
> CPU1 CPU2
> echo 1 > ${path}/device/sriov_numvfs
> lock
> set 1
> unlock
> context_switch ->
> cat ${path}/device/sriov_numvfs
> lock
> return 1
> unlock
>
> So same scenario will return different values if user doesn't protect
> such case with external to the kernel lock.
>
> But if we return back to Pierre report and if you want to provide
> completely bullet proof solution to solve cross-subsystem interaction,
> you will need to prohibit device probe till sriov_numvfs update is completed.
> However, it is overkill for something that is not a real issue.

Pierre wanted to detect the configuration change and learn the new
num_vfs, which seems like a reasonable thing to do. Is there a way to
do both via netlink or some other mechanism?

> > So I would drop this sentence because I don't think it accurately
> > reflects the reason for 35ff867b7657.
> >
> > > > Since you are reverting a commit that synchronizes SysFS read
> > > > /write, please add some comments about why it is not an
> > > > issue anymore.
> > >
> > > It was never an issue, the idea that sysfs read and write should be
> > > serialized by kernel is not correct by definition.
> >
> > I think it *was* an issue. The behavior Pierre observed at was
> > clearly wrong,
>
> I disagree with this sentence.
>
> > and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs
> > sriov_numvfs reads vs writes") to resolve it.
> >
> > We should try to avoid reintroducing the problem, so I think we should
> > probably squash these two patches and describe it as a deadlock fix
> > instead of dismissing 35ff867b7657 as being based on false premises.
> >
> > It would be awesome if you had time to verify that these patches also
> > resolve the problem you saw, Pierre.
>
> They won't resolve his problem, because he is not listening to sysfs
> events, but rely on something from netdev side.

I guess that means that if we apply this revert, the problem Pierre
reported will return. Obviously the deadlock is more important than
the inconsistency Pierre observed, but from the user's point of view
this will look like a regression.

Maybe listening to netlink and then looking at sysfs isn't the
"correct" way to do this, but I don't want to just casually break
existing user code. If we do contemplate doing the revert, at the
very least we should include specific details about what the user code
*should* do instead, at the level of the actual commands to use
instead of "ip monitor dev; cat ${path}/device/sriov_numvfs".

Bjorn