Re: Plan for /dev/ioasid RFC v2

From: David Gibson
Date: Thu Jun 24 2021 - 00:53:35 EST


On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
> On Thu, 17 Jun 2021 07:31:03 +0000
> "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > Sent: Thursday, June 17, 2021 3:40 AM
> > > On Wed, 16 Jun 2021 06:43:23 +0000
> > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > > > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > > > Sent: Wednesday, June 16, 2021 12:12 AM
> > > > > On Tue, 15 Jun 2021 02:31:39 +0000
> > > > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > > > > > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > > > > > Sent: Tuesday, June 15, 2021 12:28 AM
[snip]

> > > > > 3) A dual-function conventional PCI e1000 NIC where the functions are
> > > > > grouped together due to shared RID.
> > > > >
> > > > > a) Repeat 2.a) and 2.b) such that we have a valid, user accessible
> > > > > devices in the same IOMMU context.
> > > > >
> > > > > b) Function 1 is detached from the IOASID.
> > > > >
> > > > > I think function 1 cannot be placed into a different IOMMU context
> > > > > here, does the detach work? What's the IOMMU context now?
> > > >
> > > > Yes. Function 1 is back to block-DMA. Since both functions share RID,
> > > > essentially it implies function 0 is in block-DMA state too (though its
> > > > tracking state may not change yet) since the shared IOMMU context
> > > > entry blocks DMA now. In IOMMU fd function 0 is still attached to the
> > > > IOASID thus the user still needs do an explicit detach to clear the
> > > > tracking state for function 0.
> > > >
> > > > >
> > > > > c) A new IOASID is alloc'd within the existing iommu_fd and function
> > > > > 1 is attached to the new IOASID.
> > > > >
> > > > > Where, how, by whom does this fail?
> > > >
> > > > No need to fail. It can succeed since doing so just hurts user's own foot.
> > > >
> > > > The only question is how user knows the fact that a group of devices
> > > > share RID thus avoid such thing. I'm curious how it is communicated
> > > > with today's VFIO mechanism. Yes the group-centric VFIO uAPI prevents
> > > > a group of devices from attaching to multiple IOMMU contexts, but
> > > > suppose we still need a way to tell the user to not do so. Especially
> > > > such knowledge would be also reflected in the virtual PCI topology
> > > > when the entire group is assigned to the guest which needs to know
> > > > this fact when vIOMMU is exposed. I haven't found time to investigate
> > > > it but suppose if such channel exists it could be reused, or in the worst
> > > > case we may have the new device capability interface to convey...
> > >
> > > No such channel currently exists, it's not an issue today, IOMMU
> > > context is group-based.
> >
> > Interesting... If such group of devices are assigned to a guest, how does
> > Qemu decide the virtual PCI topology for them? Do they have same
> > vRID or different?
>
> That's the beauty of it, it doesn't matter how many RIDs exist in the
> group, or which devices have aliases, the group is the minimum
> granularity of a container where QEMU knows that a container provides
> a single address space. Therefore a container must exist in a single
> address space in the PCI topology. In a conventional or non-vIOMMU
> topology, the PCI address space is equivalent to the system memory
> address space. When vIOMMU gets involved, multiple devices within the
> same group must exist in the same address space. A vPCIe-to-PCI bridge
> can be used to create that shared address space.
>
> I've referred to this as a limitation of type1, that we can't put
> devices within the same group into different address spaces, such as
> behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> As isolation support improves we see fewer multi-device groups, this
> scenario becomes the exception. Buy better hardware to use the devices
> independently.

Also, that limitation is fundamental. Groups in a guest must always
be the same or strictly bigger than groups in the host, because if the
real hardware can't isolate them, then the virtual hardware certainly
can't and the guest kernel shouldn't be given the impression that it
can separate them.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature