Re: [PATCH vfio] vfio/pci: remove msi domain on msi disable

From: Nelson, Shannon
Date: Mon Sep 18 2023 - 13:49:05 EST


On 9/18/2023 7:17 AM, Jason Gunthorpe wrote:

On Thu, Sep 14, 2023 at 12:14:06PM -0700, Shannon Nelson wrote:
The new MSI dynamic allocation machinery is great for making the irq
management more flexible. It includes caching information about the
MSI domain which gets reused on each new open of a VFIO fd. However,
this causes an issue when the underlying hardware has flexible MSI-x
configurations, as a changed configuration doesn't get seen between
new opens, and is only refreshed between PCI unbind/bind cycles.

In our device we can change the per-VF MSI-x resource allocation
without the need for rebooting or function reset. For example,

1. Initial power up and kernel boot:
# lspci -s 2e:00.1 -vv | grep MSI-X
Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-

2. Device VF configuration change happens with no reset

Is this an out of tree driver problem?

No, not an out-of-tree driver, this is the vfio pci core.


The intree way to alter the MSI configuration is via
sriov_set_msix_vec_count, and there is only one in-tree driver that
uses it right now.

If something is going wrong here it should be fixed in the
sriov_set_msix_vec_count() machinery, possibly in the pci core to
synchronize the msi_domain view of the world.

Jason

The sriov_set_msix_vec_count method assumes
(a) the unbind/bind cycle on the VF, and
(b) VF MSIx count change configured from the host
neither of which are the case in our situation.

In our case, the VF device's msix count value found in PCI config space is changed by device configuration management outside of the baremetal host and read by the QEMU instance when it starts up, and then read by the vfio PCI core when QEMU requests the first IRQ. The core code enables the msix range on first IRQ request, and disables it when the vfio fd is closed. It creates the msi_domain on the first call if it doesn't exist, but it does not remove the msi_domain when the irqs are disabled.

The IRQ request call trace looks like:
QEMU: vfio_msix_vector_do_use->ioctl()
(driver.vfio_device_ops.ioctl = vfio_pci_core_ioctl)
vfio_pci_core_ioctl
vfio_pci_ioctl_set_irqs
vfio_pci_set_irqs_ioctl
vfio_pci_set_msi_trigger
vfio_msi_enable
pci_alloc_irq_vectors
pci_alloc_irq_vectors_affinity
__pci_enable_msix_range
pci_setup_msix_device_domain
return if msi_domain exists
pci_create_device_domain
msi_create_device_irq_domain
__msi_create_irq_domain - sets info->hwsize
msi_capability_init
msi_setup_msi_desc
msi_insert_msi_desc
msi_domain_insert_msi_desc
msi_insert_desc
fail if index >= hwsize

On close of the vfio fd, the trace is:
QEMU: close()
driver.vfio_device_ops.close
vfio_pci_core_close_device
vfio_pci_core_disable
vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
VFIO_IRQ_SET_ACTION_TRIGGER,
vdev->irq_type, 0, 0, NULL);
vfio_pci_set_msi_trigger
vfio_msi_disable
pci_free_irq_vectors

The msix vectors are freed, but the msi_domain is not, and the msi_domain holds the MSIx count that it read when it was created. If the device's MSIx count is increased, the next QEMU session will see the new number in PCI config space and try to use that new larger number, but the msi_domain is still using the smaller hwsize and the QEMU IRQ setup fails in msi_insert_desc().

This patch adds a msi_remove_device_irq_domain() call when the irqs are disabled in order to force a new read on the next IRQ allocation cycle. This is limited to only the vfio use of the msi_domain.

I suppose we could add this to the trailing end of callbacks in our own driver, but this looks more like a generic vfio/msi issue than a driver specific thing.

The other possibility is to force the user to always do a bind cycle between QEMU sessions using the VF. This seems to be unnecessary overhead and was not necessary when using the v6.1 kernel. To the user, this looks like a regression - this is how it was reported to me.

sln