Re: [PATCH 15/17] vfio/pci: Let enable and disable of interrupt types use same signature

From: Alex Williamson
Date: Thu Feb 08 2024 - 16:08:50 EST


On Wed, 7 Feb 2024 15:30:15 -0800
Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote:

> Hi Alex,
>
> On 2/6/2024 3:19 PM, Alex Williamson wrote:
> > On Tue, 6 Feb 2024 14:22:04 -0800
> > Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote:
> >> On 2/6/2024 2:03 PM, Alex Williamson wrote:
> >>> On Tue, 6 Feb 2024 13:46:37 -0800
> >>> Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote:
> >>>> On 2/5/2024 2:35 PM, Alex Williamson wrote:
> >>>>> On Thu, 1 Feb 2024 20:57:09 -0800
> >>>>> Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote:
> >>>>
> >>>> ..
> >>>>
> >>>>>> @@ -715,13 +724,13 @@ static int vfio_pci_set_intx_trigger(struct vfio_pci_core_device *vdev,
> >>>>>> if (is_intx(vdev))
> >>>>>> return vfio_irq_set_block(vdev, start, count, fds, index);
> >>>>>>
> >>>>>> - ret = vfio_intx_enable(vdev);
> >>>>>> + ret = vfio_intx_enable(vdev, start, count, index);
> >>>>>
> >>>>> Please trace what happens when a user calls SET_IRQS to setup a trigger
> >>>>> eventfd with start = 0, count = 1, followed by any other combination of
> >>>>> start and count values once is_intx() is true. vfio_intx_enable()
> >>>>> cannot be the only place we bounds check the user, all of the INTx
> >>>>> callbacks should be an error or nop if vector != 0. Thanks,
> >>>>>
> >>>>
> >>>> Thank you very much for catching this. I plan to add the vector
> >>>> check to the device_name() and request_interrupt() callbacks. I do
> >>>> not think it is necessary to add the vector check to disable() since
> >>>> it does not operate on a range and from what I can tell it depends on
> >>>> a successful enable() that already contains the vector check. Similar,
> >>>> free_interrupt() requires a successful request_interrupt() (that will
> >>>> have vector check in next version).
> >>>> send_eventfd() requires a valid interrupt context that is only
> >>>> possible if enable() or request_interrupt() succeeded.
> >>>
> >>> Sounds reasonable.
> >>>
> >>>> If user space creates an eventfd with start = 0 and count = 1
> >>>> and then attempts to trigger the eventfd using another combination then
> >>>> the changes in this series will result in a nop while the current
> >>>> implementation will result in -EINVAL. Is this acceptable?
> >>>
> >>> I think by nop, you mean the ioctl returns success. Was the call a
> >>> success? Thanks,
> >>
> >> Yes, I mean the ioctl returns success without taking any
> >> action (nop).
> >>
> >> It is not obvious to me how to interpret "success" because from what I
> >> understand current INTx and MSI/MSI-x are behaving differently when
> >> considering this flow. If I understand correctly, INTx will return
> >> an error if user space attempts to trigger an eventfd that has not
> >> been set up while MSI and MSI-x will return 0.
> >>
> >> I can restore existing INTx behavior by adding more logic and a return
> >> code to the send_eventfd() callback so that the different interrupt types
> >> can maintain their existing behavior.
> >
> > Ah yes, I see the dilemma now. INTx always checked start/count were
> > valid but MSI/X plowed through regardless, and with this series we've
> > standardized the loop around the MSI/X flow.
> >
> > Tricky, but probably doesn't really matter. Unless we break someone.
> >
> > I can ignore that INTx can be masked and signaling a masked vector
> > doesn't do anything, but signaling an unconfigured vector feels like an
> > error condition and trying to create verbiage in the uAPI header to
> > weasel out of that error and unconditionally return success makes me
> > cringe.
> >
> > What if we did this:
> >
> > uint8_t *bools = data;
> > ...
> > for (i = start; i < start + count; i++) {
> > if ((flags & VFIO_IRQ_SET_DATA_NONE) ||
> > ((flags & VFIO_IRQ_SET_DATA_BOOL) && bools[i - start])) {
> > ctx = vfio_irq_ctx_get(vdev, i);
> > if (!ctx || !ctx->trigger)
> > return -EINVAL;
> > intr_ops[index].send_eventfd(vdev, ctx);
> > }
> > }
> >
>
> This looks good. Thank you very much. Will do.
>
> I studied the code more and have one more observation related to this portion
> of the flow:
> From what I can tell this change makes the INTx code more robust. If I
> understand current implementation correctly it seems possible to enable
> INTx but not have interrupt allocated. In this case the interrupt context
> (ctx) will exist but ctx->trigger will be NULL. Current
> vfio_pci_set_intx_trigger()->vfio_send_intx_eventfd() only checks if
> ctx is valid. It looks like it may call eventfd_signal(NULL) where
> pointer is dereferenced.
>
> If this is correct then I think a separate fix that can easily be
> backported may be needed. Something like:

Good find. I think it's a bit more complicated though. There are
several paths to vfio_send_intx_eventfd:

- vfio_intx_handler

This can only be called between request_irq() and free_irq()
where trigger is always valid. igate is not held.

- vfio_pci_intx_unmask

Callers hold igate, additional test of ctx->trigger makes this
safe.

- vfio_pci_set_intx_trigger

Same as above.

- Through unmask eventfd (virqfd)

Here be dragons.

In the virqfd case, a write to the eventfd calls virqfd_wakeup() where
we'll call the handler, vfio_pci_intx_unmask_handler(), and based on
the result schedule the thread, vfio_send_intx_eventfd(). Both of
these look suspicious. They're not called under igate, so testing
ctx->trigger doesn't resolve the race.

I think an option is to wrap the virqfd entry points in igate where we
can then do something similar to your suggestion. I don't think we
want to WARN_ON(!ctx->trigger) because that's then a user reachable
condition. Instead we can just quietly follow the same exit paths.

I think that means we end up with something like this:

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 237beac83809..ace7e1dbc607 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -92,12 +92,21 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
struct vfio_pci_irq_ctx *ctx;

ctx = vfio_irq_ctx_get(vdev, 0);
- if (WARN_ON_ONCE(!ctx))
+ if (WARN_ON_ONCE(!ctx) || !ctx->trigger)
return;
eventfd_signal(ctx->trigger);
}
}

+static void vfio_send_intx_eventfd_virqfd(void *opaque, void *unused)
+{
+ struct vfio_pci_core_device *vdev = opaque;
+
+ mutex_lock(&vdev->igate);
+ vfio_send_intx_eventfd(opaque, unused);
+ mutex_unlock(&vdev->igate);
+}
+
/* Returns true if the INTx vfio_pci_irq_ctx.masked value is changed. */
bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
{
@@ -170,7 +179,7 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
}

ctx = vfio_irq_ctx_get(vdev, 0);
- if (WARN_ON_ONCE(!ctx))
+ if (WARN_ON_ONCE(!ctx) || !ctx->trigger)
goto out_unlock;

if (ctx->masked && !vdev->virq_disabled) {
@@ -194,6 +203,18 @@ static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
return ret;
}

+static int vfio_pci_intx_unmask_handler_virqfd(void *opaque, void *unused)
+{
+ struct vfio_pci_core_device *vdev = opaque;
+ int ret;
+
+ mutex_lock(&vdev->igate);
+ ret = vfio_pci_intx_unmask_handler(opaque, unused);
+ mutex_unlock(&vdev->igate);
+
+ return ret;
+}
+
void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev)
{
if (vfio_pci_intx_unmask_handler(vdev, NULL) > 0)
@@ -572,10 +593,10 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_core_device *vdev,
if (WARN_ON_ONCE(!ctx))
return -EINVAL;
if (fd >= 0)
- return vfio_virqfd_enable((void *) vdev,
- vfio_pci_intx_unmask_handler,
- vfio_send_intx_eventfd, NULL,
- &ctx->unmask, fd);
+ return vfio_virqfd_enable((void *)vdev,
+ vfio_pci_intx_unmask_handler_virqfd,
+ vfio_send_intx_eventfd_virqfd, NULL,
+ &ctx->unmask, fd);

vfio_virqfd_disable(&ctx->unmask);
}


WDYT?

> > And we note the behavior change for MSI/X in the commit log and if
> > someone shouts that we broke them, we can make that an -errno or
> > continue based on is_intx(). Sound ok? Thanks,
>
> I'll be sure to highlight the impact on MSI/MSI-x. Please do expect this
> in the final patch "vfio/pci: Remove duplicate interrupt management flow"
> though since that is where the different flows are merged.
>
> I am not familiar with how all user space interacts with this flow and if/how
> this may break things. I did look at Qemu code and I was not able to find
> where it intentionally triggers MSI/MSI-x interrupts, I could only find it
> for INTx.

Being able to trigger the interrupt via ioctl is more of a diagnostic
feature, not typically used in production.

> If this does break things I would like to also consider moving the
> different behavior into the interrupt type's respective send_eventfd()
> callback instead of adding interrupt type specific code (like
> is_intx()) into the shared flow.

Sure, we can pick the best option in the slim (imo) chance the change
affects anyone. Thanks,

Alex