Re: [RFC] /dev/ioasid uAPI proposal

From: Jason Gunthorpe
Date: Wed Jun 02 2021 - 12:58:46 EST


On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > /* Bind guest I/O page table */
> > > bind_data = {
> > > .ioasid = gva_ioasid;
> > > .addr = gva_pgtable1;
> > > // and format information
> > > };
> > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> >
> > Again I do wonder if this should just be part of alloc_ioasid. Is
> > there any reason to split these things? The only advantage to the
> > split is the device is known, but the device shouldn't impact
> > anything..
>
> I'm pretty sure the device(s) could matter, although they probably
> won't usually.

It is a bit subtle, but the /dev/iommu fd itself is connected to the
devices first. This prevents wildly incompatible devices from being
joined together, and allows some "get info" to report the capability
union of all devices if we want to do that.

The original concept was that devices joined would all have to support
the same IOASID format, at least for the kernel owned map/unmap IOASID
type. Supporting different page table formats maybe is reason to
revisit that concept.

There is a small advantage to re-using the IOASID container because of
the get_user_pages caching and pinned accounting management at the FD
level.

I don't know if that small advantage is worth the extra complexity
though.

> But it would certainly be possible for a system to have two
> different host bridges with two different IOMMUs with different
> pagetable formats. Until you know which devices (and therefore
> which host bridge) you're talking about, you don't know what formats
> of pagetable to accept. And if you have devices from *both* bridges
> you can't bind a page table at all - you could theoretically support
> a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> in both formats, but it would be pretty reasonable not to support
> that.

The basic process for a user space owned pgtable mode would be:

1) qemu has to figure out what format of pgtable to use

Presumably it uses query functions using the device label. The
kernel code should look at the entire device path through all the
IOMMU HW to determine what is possible.

Or it already knows because the VM's vIOMMU is running in some
fixed page table format, or the VM's vIOMMU already told it, or
something.

2) qemu creates an IOASID and based on #1 and says 'I want this format'

3) qemu binds the IOASID to the device.

If qmeu gets it wrong then it just fails.

4) For the next device qemu would have to figure out if it can re-use
an existing IOASID based on the required proeprties.

You pointed to the case of mixing vIOMMU's of different platforms. So
it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU
page table mode v2" while running on an x86 because that is what the
vIOMMU is wired to work with.

Presumably qemu will fall back to software emulation if this is not
possible.

One interesting option for software emulation is to just transform the
ARM page table format to a x86 page table format in userspace and use
nested bind/invalidate to synchronize with the kernel. With SW nesting
I suspect this would be much faster

Jason