Re: [PATCH vfio] vfio/pci: remove msi domain on msi disable

From: Jason Gunthorpe
Date: Mon Sep 18 2023 - 20:02:23 EST


On Tue, Sep 19, 2023 at 01:47:37AM +0200, Thomas Gleixner wrote:
> On Mon, Sep 18 2023 at 20:37, Jason Gunthorpe wrote:
> > On Mon, Sep 18, 2023 at 08:43:21PM +0200, Thomas Gleixner wrote:
> >> On Mon, Sep 18 2023 at 11:17, Jason Gunthorpe wrote:
> >> > On Thu, Sep 14, 2023 at 12:14:06PM -0700, Shannon Nelson wrote:
> >> >> The new MSI dynamic allocation machinery is great for making the irq
> >> >> management more flexible. It includes caching information about the
> >> >> MSI domain which gets reused on each new open of a VFIO fd. However,
> >> >> this causes an issue when the underlying hardware has flexible MSI-x
> >> >> configurations, as a changed configuration doesn't get seen between
> >> >> new opens, and is only refreshed between PCI unbind/bind cycles.
> >> >>
> >> >> In our device we can change the per-VF MSI-x resource allocation
> >> >> without the need for rebooting or function reset. For example,
> >> >>
> >> >> 1. Initial power up and kernel boot:
> >> >> # lspci -s 2e:00.1 -vv | grep MSI-X
> >> >> Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-
> >> >>
> >> >> 2. Device VF configuration change happens with no reset
> >> >
> >> > Is this an out of tree driver problem?
> >> >
> >> > The intree way to alter the MSI configuration is via
> >> > sriov_set_msix_vec_count, and there is only one in-tree driver that
> >> > uses it right now.
> >>
> >> Right, but that only addresses the driver specific issues.
> >
> > Sort of.. sriov_vf_msix_count_store() is intended to be the entry
> > point for this and if the kernel grows places that cache the value or
> > something then this function should flush those caches too.
>
> Sorry. What I wanted to say is that the driver callback is not the right
> place to reload the MSI domains after the change.

Oh, that isn't even what Shannon's patch does, it patched VFIO's main
PCI driver - not a sriov_set_msix_vec_count() callback :( Shannon's
scenario doesn't even use sriov_vf_msix_count_store() at all - the AMD
device just randomly changes its MSI count whenever it likes.

> > I suppose flushing happens implicitly because Shannon reports that
> > things work fine if the driver is rebound. Since
> > sriov_vf_msix_count_store() ensures there is no driver bound before
> > proceeding it probe/unprobe must be flushing out everything?
>
> Correct. So sriov_set_msix_vec_count() could just do:
>
> ret = pdev->driver->sriov_set_msix_vec_count(vf_dev, val);
> if (!ret)
> teardown_msi_domain(pdev);
>
> Right?

It subtly isn't needed, sriov_vf_msix_count_store() already requires
no driver is associated with the device and this:

int msi_setup_device_data(struct device *dev)
{
struct msi_device_data *md;
int ret, i;

if (dev->msi.data)
return 0;

md = devres_alloc(msi_device_data_release, sizeof(*md), GFP_KERNEL);
if (!md)
return -ENOMEM;

Already ensured that msi_remove_device_irq_domain() was called via
msi_device_data_release() triggering as part of the devm shutdown of
the bound driver.

So, the intree mechanism to change the MSI vector size works. The
crazy mechanism where the device just changes its value without
synchronizing to the OS does not.

I don't think we need to try and fix that..

Jason