RE: [RFC PATCH V3 00/26] vfio/pci: Back guest interrupts from Interrupt Message Store (IMS)

From: Tian, Kevin
Date: Tue Nov 07 2023 - 03:29:19 EST


> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Sent: Friday, November 3, 2023 11:51 PM
>
> On Fri, 3 Nov 2023 07:23:13 +0000
> "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
>
> > > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > Sent: Friday, November 3, 2023 5:14 AM
> > >
> > > On Thu, 2 Nov 2023 03:14:09 +0000
> > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > >
> > > > > From: Tian, Kevin
> > > > > Sent: Thursday, November 2, 2023 10:52 AM
> > > > >
> > > > > >
> > > > > > Without an in-tree user of this code, we're just chopping up code for
> > > > > > no real purpose. There's no reason that a variant driver requiring
> IMS
> > > > > > couldn't initially implement their own SET_IRQS ioctl. Doing that
> > > > >
> > > > > this is an interesting idea. We haven't seen a real usage which wants
> > > > > such MSI emulation on IMS for variant drivers. but if the code is
> > > > > simple enough to demonstrate the 1st user of IMS it might not be
> > > > > a bad choice. There are additional trap-emulation required in the
> > > > > device MMIO bar (mostly copying MSI permission entry which
> contains
> > > > > PASID info to the corresponding IMS entry). At a glance that area
> > > > > is 4k-aligned so should be doable.
> > > > >
> > > >
> > > > misread the spec. the MSI-X permission table which provides
> > > > auxiliary data to MSI-X table is not 4k-aligned. It sits in the 1st
> > > > 4k page together with many other registers. emulation of them
> > > > could be simple with a native read/write handler but not sure
> > > > whether any of them may sit in a hot path to affect perf due to
> > > > trap...
> > >
> > > I'm not sure if you're referring to a specific device spec or the PCI
> > > spec, but the PCI spec has long included an implementation note
> > > suggesting alignment of the MSI-X vector table and pba and separation
> > > from CSRs, and I see this is now even more strongly worded in the 6.0
> > > spec.
> > >
> > > Note though that for QEMU, these are emulated in the VMM and not
> > > written through to the device. The result of writes to the vector
> > > table in the VMM are translated to vector use/unuse operations, which
> > > we see at the kernel level through SET_IRQS ioctl calls. Are you
> > > expecting to get PASID information written by the guest through the
> > > emulated vector table? That would entail something more than a simple
> > > IMS backend to MSI-X frontend. Thanks,
> > >
> >
> > I was referring to IDXD device spec. Basically it allows a process to
> > submit a descriptor which contains a completion interrupt handle.
> > The handle is the index of a MSI-X entry or IMS entry allocated by
> > the idxd driver. To mark the association between application and
> > related handles the driver records the PASID of the application
> > in an auxiliary structure for MSI-X (called MSI-X permission table)
> > or directly in the IMS entry. This additional info includes whether
> > an MSI-X/IMS entry has PASID enabled and if yes what is the PASID
> > value to be checked against the descriptor.
> >
> > As you said virtualizing MSI-X table itself is via SET_IRQS and it's
> > 4k aligned. Then we also need to capture guest updates to the MSI-X
> > permission table and copy the PASID information into the
> > corresponding IMS entry when using the IMS backend. It's MSI-X
> > permission table not 4k aligned then trapping it will affect adjacent
> > registers.
> >
> > My quick check in idxd spec doesn't reveal an real impact in perf
> > critical path. Most registers are configuration/control registers
> > accessed at driver init time and a few interrupt registers related
> > to errors or administrative purpose.
>
> Right, it looks like you'll need to trap writes to the MSI-X
> Permissions Table via a sparse mmap capability to avoid assumptions
> whether it lives on the same page as the MSI-X vector table or PBA.
> Ideally the hardware folks have considered this to avoid any conflict
> with latency sensitive registers.
>
> The variant driver would use this for collecting the meta data relative
> to the IMS interrupt, but this is all tangential to whether we
> preemptively slice up vfio-pci-core's SET_IRQS ioctl or the iDXD driver
> implements its own.

Agree

>
> And just to be clear, I don't expect the iDXD variant driver to go to
> extraordinary lengths to duplicate the core ioctl, we can certainly
> refactor and export things where it makes sense, but I think it likely
> makes more sense for the variant driver to implement the shell of the
> ioctl rather than trying to multiplex the entire core ioctl with an ops
> structure that's so intimately tied to the core implementation and
> focused only on the MSI-X code paths. Thanks,
>

btw I'll let Reinette to decide whether she wants to implement such
a variant driver or waits until idxd siov driver is ready to demonstrate
the usage. One concern with the variant driver approach is lacking
of a real-world usage (e.g. what IMS brings when guest only wants
MSI-X on an assigned PF). Though it may provide a shorter path
to enable the IMS backend support, also comes with the long-term
maintenance burden.