Re: [RFC] /dev/ioasid uAPI proposal

From: David Gibson
Date: Wed Jun 02 2021 - 03:29:23 EST


On Fri, May 28, 2021 at 04:58:39PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> >
> > ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
>
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Leaving aside whether group fds should exist, while they *do* exist
binding to an IOASID should be done on the group not an individual
device.

[snip]
> > /* if dev1 is ENQCMD-capable mdev, update CPU PASID
> > * translation structure through KVM
> > */
> > pa_data = {
> > .ioasid_fd = ioasid_fd;
> > .ioasid = gva_ioasid;
> > .guest_pasid = gpasid1;
> > };
> > ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
>
> Make sense
>
> > /* Bind guest I/O page table */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I'm pretty sure the device(s) could matter, although they probably
won't usually. But it would certainly be possible for a system to
have two different host bridges with two different IOMMUs with
different pagetable formats. Until you know which devices (and
therefore which host bridge) you're talking about, you don't know what
formats of pagetable to accept. And if you have devices from *both*
bridges you can't bind a page table at all - you could theoretically
support a kernel managed pagetable by mirroring each MAP and UNMAP to
tables in both formats, but it would be pretty reasonable not to
support that.

> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> >
> > - Host IOMMU driver receives a page request with raw fault_data {rid,
> > pasid, addr};
> >
> > - Host IOMMU driver identifies the faulting I/O page table according to
> > information registered by IOASID fault handler;
> >
> > - IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
> > is saved in ioasid_data->fault_data (used for response);
> >
> > - IOASID fault handler generates an user fault_data (ioasid, addr), links it
> > to the shared ring buffer and triggers eventfd to userspace;
>
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

I like the idea of labelling devices when they're attached, it makes
extension to non-PCI devices much more obvious that having to deal
with concrete RIDs.

But, remember we can only (reliably) determine rid up to the group
boundary. So if you're labelling devices, all devices in a group
would have to have the same label. Or you attach the label to a group
not a device, which would be a reason to represent the group as an
object again.

> > - Upon received event, Qemu needs to find the virtual routing information
> > (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> > multiple, pick a random one. This should be fine since the purpose is to
> > fix the I/O page table on the guest;
>
> The device label should fix this
>
> > - Qemu finds the pending fault event, converts virtual completion data
> > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> > complete the pending fault;
> >
> > - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> > ioasid_data->fault_data, and then calls iommu api to complete it with
> > {rid, pasid, response_code};
>
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
>
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> >
> > PASID table is put in the GPA space on some platform, thus must be updated
> > by the guest. It is treated as another user page table to be bound with the
> > IOMMU.
> >
> > As explained earlier, the user still needs to explicitly bind every user I/O
> > page table to the kernel so the same pgtable binding protocol (bind, cache
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which, once
> > enabled, requires the guest to invalidate PASID cache for any change on the
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> >
> > /* After boots */
> > /* Make vPASID space nested on GPA space */
> > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to pasidtbl_ioasid */
> > at_data = { .ioasid = pasidtbl_ioasid};
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind PASID table */
> > bind_data = {
> > .ioasid = pasidtbl_ioasid;
> > .addr = gpa_pasid_table;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> >
> > /* vIOMMU detects a new GVA I/O space created */
> > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to the new address space, with gpasid1 */
> > at_data = {
> > .ioasid = gva_ioasid;
> > .flag = IOASID_ATTACH_USER_PASID;
> > .user_pasid = gpasid1;
> > };
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > * used, the kernel will not update the PASID table. Instead, just
> > * track the bound I/O page table for handling invalidation and
> > * I/O page faults.
> > */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> I still don't quite get the benifit from doing this.
>
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
>
> Cache invalidate seems easy enough to support
>
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
>
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature