Re: [RFC] /dev/ioasid uAPI proposal

From: Jason Gunthorpe
Date: Wed Jun 02 2021 - 13:19:35 EST


On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote:

> Is there a compelling reason to have all the IOASIDs handled by one
> FD?

There was an answer on this, if every PASID needs an IOASID then there
are too many FDs.

It is difficult to share the get_user_pages cache across FDs.

There are global properties in the /dev/iommu FD, like what devices
are part of it, that are important for group security operations. This
becomes confused if it is split to many FDs.

> > I/O address space can be managed through two protocols, according to
> > whether the corresponding I/O page table is constructed by the kernel or
> > the user. When kernel-managed, a dma mapping protocol (similar to
> > existing VFIO iommu type1) is provided for the user to explicitly specify
> > how the I/O address space is mapped. Otherwise, a different protocol is
> > provided for the user to bind an user-managed I/O page table to the
> > IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> > handling.
> >
> > Pgtable binding protocol can be used only on the child IOASID's, implying
> > IOASID nesting must be enabled. This is because the kernel doesn't trust
> > userspace. Nesting allows the kernel to enforce its DMA isolation policy
> > through the parent IOASID.
>
> To clarify, I'm guessing that's a restriction of likely practice,
> rather than a fundamental API restriction. I can see a couple of
> theoretical future cases where a user-managed pagetable for a "base"
> IOASID would be feasible:
>
> 1) On some fancy future MMU allowing free nesting, where the kernel
> would insert an implicit extra layer translating user addresses
> to physical addresses, and the userspace manages a pagetable with
> its own VAs being the target AS

I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one
where the IOVA == process VA and the kernel maintains this mapping.

Since the uAPI is so general I do have a general expecation that the
drivers/iommu implementations might need to be a bit more complicated,
like if the HW can optimize certain specific graphs of IOASIDs we
would still model them as graphs and the HW driver would have to
"compile" the graph into the optimal hardware.

This approach has worked reasonable in other kernel areas.

> 2) For a purely software virtual device, where its virtual DMA
> engine can interpet user addresses fine

This also sounds like an SVA IOASID.

Depending on HW if a device can really only bind to a very narrow kind
of IOASID then it should ask for that (probably platform specific!)
type during its attachment request to drivers/iommu.

eg "I am special hardware and only know how to do PLATFORM_BLAH
transactions, give me an IOASID comatible with that". If the only way
to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is
hardwired to the CPU ASID then that is just how it is.

> I wonder if there's a way to model this using a nested AS rather than
> requiring special operations. e.g.
>
> 'prereg' IOAS
> |
> \- 'rid' IOAS
> |
> \- 'pasid' IOAS (maybe)
>
> 'prereg' would have a kernel managed pagetable into which (for
> example) qemu platform code would map all guest memory (using
> IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's
> IO mappings into the 'rid' IOAS in terms of GPA.
>
> This wouldn't quite work as is, because the 'prereg' IOAS would have
> no devices. But we could potentially have another call to mark an
> IOAS as a purely "preregistration" or pure virtual IOAS. Using that
> would be an alternative to attaching devices.

It is one option for sure, this is where I was thinking when we were
talking in the other thread. I think the decision is best
implementation driven as the datastructure to store the
preregsitration data should be rather purpose built.

> > /*
> > * Map/unmap process virtual addresses to I/O virtual addresses.
> > *
> > * Provide VFIO type1 equivalent semantics. Start with the same
> > * restriction e.g. the unmap size should match those used in the
> > * original mapping call.
> > *
> > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > * must be already in the preregistered list.
> > *
> > * Input parameters:
> > * - u32 ioasid;
> > * - refer to vfio_iommu_type1_dma_{un}map
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
>
> I'm assuming these would be expected to fail if a user managed
> pagetable has been bound?

Me too, or a SVA page table.

This document would do well to have a list of imagined page table
types and the set of operations that act on them. I think they are all
pretty disjoint..

Your presentation of 'kernel owns the table' vs 'userspace owns the
table' is a useful clarification to call out too

> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
>
> Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> filenames for actual PCI functions. Maybe /dev/vfio/mdev/something
> for mdevs. That leaves other subdirs of /dev/vfio free for future
> non-PCI device types, and /dev/vfio itself for the legacy group
> devices.

There are a bunch of nice options here if we go this path

> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid.
>
> Doesn't really affect your example, but note that the PAPR IOMMU does
> not have a passthrough mode, so devices will not initially be attached
> to gpa_ioasid - they will be unusable for DMA until attached to a
> gIOVA ioasid.

I think attachment should always be explicit in the API. If the user
doesn't explicitly ask for a device to be attached to the IOASID then
the iommu driver is free to block it.

If you want passthrough then you have to create a passthrough IOASID
and attach every device to it. Some of those attaches might be NOP's
due to groups.

Jason