Re: virtio-iommu hotplug issue

From: Akihiko Odaki
Date: Thu Apr 13 2023 - 22:51:37 EST


On 2023/04/13 22:39, Eric Auger wrote:
Hi,

On 4/13/23 13:01, Akihiko Odaki wrote:
On 2023/04/13 19:40, Jean-Philippe Brucker wrote:
Hello,

On Thu, Apr 13, 2023 at 01:49:43PM +0900, Akihiko Odaki wrote:
Hi,

Recently I encountered a problem with the combination of Linux's
virtio-iommu driver and QEMU when a SR-IOV virtual function gets
disabled.
I'd like to ask you what kind of solution is appropriate here and
implement
the solution if possible.

A PCIe device implementing the SR-IOV specification exports a virtual
function, and the guest can enable or disable it at runtime by
writing to a
configuration register. This effectively looks like a PCI device is
hotplugged for the guest.

Just so I understand this better: the guest gets a whole PCIe device PF
that implements SR-IOV, and so the guest can dynamically create VFs?
Out
of curiosity, is that a hardware device assigned to the guest with VFIO,
or a device emulated by QEMU?

Yes, that's right. The guest can dynamically create and delete VFs.
The device is emulated by QEMU: igb, an Intel NIC recently added to
QEMU and projected to be released as part of QEMU 8.0.
From below description In understand you then bind this emulated device
to VFIO on guest, correct?

Yes, that's correct.



In such a case, the kernel assumes the endpoint is
detached from the virtio-iommu domain, but QEMU actually does not
detach it.
The QEMU virtio-iommu device executes commands from the virtio-iommu
driver and my understanding is the VFIO infra is not in trouble here. As
suggested by Jean, a detach command probably is missed.

VFIO just illustrates the problem and the origin of the problem is indeed virtio-iommu.

Regards,
Akihiko Odaki


This inconsistent view of the removed device sometimes prevents the
VM from
correctly performing the following procedure, for example:
1. Enable a VF.
2. Disable the VF.
3. Open a vfio container.
4. Open the group which the PF belongs to.
5. Add the group to the vfio container.
6. Map some memory region.
7. Close the group.
8. Close the vfio container.
9. Repeat 3-8

When the VF gets disabled, the kernel assumes the endpoint is
detached from
the IOMMU domain, but QEMU actually doesn't detach it. Later, the
domain
will be reused in step 3-8.

In step 7, the PF will be detached, and the kernel thinks there is no
endpoint attached and the mapping the domain holds is cleared, but
the VF
endpoint is still attached and the mapping is kept intact.

In step 9, the same domain will be reused again, and the kernel
requests to
create a new mapping, but it will conflict with the existing mapping
and
result in -EINVAL.

This problem can be fixed by either of:
- requesting the detachment of the endpoint from the guest when the PCI
device is unplugged (the VF is disabled)

Yes, I think this is an issue in the virtio-iommu driver, which
should be
sending a DETACH request when the VF is disabled, likely from
viommu_release_device(). I'll work on a fix unless you would like to
do it

It will be nice if you prepare a fix. I will test your patch with my
workload if you share it with me.

I can help testing too

Thanks

Eric

Regards,
Akihiko Odaki


- detecting that the PCI device is gone and automatically detach it on
QEMU-side.

It is not completely clear for me which solution is more appropriate
as the
virtio-iommu specification is written in a way independent of the
endpoint
mechanism and does not say what should be done when a PCI device is
unplugged.

Yes, I'm not sure it's in scope for the specification, it's more about
software guidance

Thanks,
Jean