RE: [RFC] /dev/ioasid uAPI proposal

From: Tian, Kevin
Date: Thu Jun 03 2021 - 02:39:37 EST


> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Saturday, May 29, 2021 7:37 AM
>
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> >
> > /*
> > * Check whether an uAPI extension is supported.
> > *
> > * This is for FD-level capabilities, such as locked page pre-registration.
> > * IOASID-level capabilities are reported through IOASID_GET_INFO.
> > *
> > * Return: 0 if not supported, 1 if supported.
> > */
> > #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0)
>
>
> > /*
> > * Register user space memory where DMA is allowed.
> > *
> > * It pins user pages and does the locked memory accounting so sub-
> > * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> > *
> > * When this ioctl is not used, one user page might be accounted
> > * multiple times when it is mapped by multiple IOASIDs which are
> > * not nested together.
> > *
> > * Input parameters:
> > * - vaddr;
> > * - size;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE,
> IOASID_BASE + 2)
>
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?

yes.

>
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
>
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then

Agree. Regarding uAPI there is no difference between SW IOASID and
HW IOASID. The main difference is behind /dev/ioasid, that SW IOASID
is not linked to the IOMMU.

> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.
>
> Either way this seems like a smart direction
>
> > /*
> > * Allocate an IOASID.
> > *
> > * IOASID is the FD-local software handle representing an I/O address
> > * space. Each IOASID is associated with a single I/O page table. User
> > * must call this ioctl to get an IOASID for every I/O address space that is
> > * intended to be enabled in the IOMMU.
> > *
> > * A newly-created IOASID doesn't accept any command before it is
> > * attached to a device. Once attached, an empty I/O page table is
> > * bound with the IOMMU then the user could use either DMA mapping
> > * or pgtable binding commands to manage this I/O page table.
>
> Can the IOASID can be populated before being attached?
>
> > * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> > *
> > * Return: allocated ioasid on success, -errno on failure.
> > */
> > #define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)
>
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?

I'll skip below /dev/ioasid uAPI comments about alloc/bind. It's already
covered in other sub-threads.

[...]

> >
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
>
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?

Exactly

>
> > /*
> > * Bind a vfio_device to the specified IOASID fd
> > *
> > * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> > * vfio device should not be bound to multiple ioasid_fd's.
> > *
> > * Input parameters:
> > * - ioasid_fd;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
>
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

As chatted earlier, either an input or output "device id" is fine here.

>
> >
> > 2.3. KVM uAPI
> > ++++++++++++
> >
> > /*
> > * Update CPU PASID mapping
> > *
> > * This is necessary when ENQCMD will be used in the guest while the
> > * targeted device doesn't accept the vPASID saved in the CPU MSR.
> > *
> > * This command allows user to set/clear the vPASID->pPASID mapping
> > * in the CPU, by providing the IOASID (and FD) information representing
> > * the I/O address space marked by this vPASID.
> > *
> > * Input parameters:
> > * - user_pasid;
> > * - ioasid_fd;
> > * - ioasid;
> > */
> > #define KVM_MAP_PASID _IO(KVMIO, 0xf0)
> > #define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)
>
> It seems simple enough.. So the physical PASID can only be assigned if
> the user has an IOASID that points at it? Thus it is secure?

Yes. The kernel doesn't trust user to provide a random physical PASID.

>
> > 3. Sample structures and helper functions
> >
> > Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> >
> > struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> > int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev
> *dev);
> > int ioasid_unregister_device(struct ioasid_dev *dev);
> >
> > An ioasid_ctx is created for each fd:
> >
> > struct ioasid_ctx {
> > // a list of allocated IOASID data's
> > struct list_head ioasid_list;
>
> Would expect an xarray
>
> > // a list of registered devices
> > struct list_head dev_list;
>
> xarray of device_id

list of ioasid_dev objects. device_id will be put inside each object.

>
> > // a list of pre-registered virtual address ranges
> > struct list_head prereg_list;
>
> Should re-use the existing SW IOASID table, and be an interval tree.

What is the existing SW IOASID table?

>
> > Each registered device is represented by ioasid_dev:
> >
> > struct ioasid_dev {
> > struct list_head next;
> > struct ioasid_ctx *ctx;
> > // always be the physical device
> > struct device *device;
> > struct kref kref;
> > };
> >
> > Because we assume one vfio_device connected to at most one ioasid_fd,
> > here ioasid_dev could be embedded in vfio_device and then linked to
> > ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> > device should be the pointer to the parent device. PASID marking this
> > mdev is specified later when VFIO_ATTACH_IOASID.
>
> Don't embed a struct like this in something with vfio_device - that
> just makes a mess of reference counting by having multiple krefs in
> the same memory block. Keep it as a pointer, the attach operation
> should return a pointer to the above struct.

OK. Also based on the agreement that one device can bind to multiple
fd's, this struct embed approach also doesn't work then.

>
> > An ioasid_data is created when IOASID_ALLOC, as the main object
> > describing characteristics about an I/O page table:
> >
> > struct ioasid_data {
> > // link to ioasid_ctx->ioasid_list
> > struct list_head next;
> >
> > // the IOASID number
> > u32 ioasid;
> >
> > // the handle to convey iommu operations
> > // hold the pgd (TBD until discussing iommu api)
> > struct iommu_domain *domain;
>
> But at least for the first coding draft I would expect to see this API
> presented with no PASID support and a simple 1:1 with iommu_domain.
> How
> PASID gets modeled is the big TBD, right?

yes. As the starting point we will assume 1:1 association. This should
work for PF/VF. But very soon mdev must be considered. I expect
we can start conversation on PASID support once this uAPI proposal
is settled down.

>
> > ioasid_data and iommu_domain have overlapping roles as both are
> > introduced to represent an I/O address space. It is still a big TBD how
> > the two should be corelated or even merged, and whether new iommu
> > ops are required to handle RID+PASID explicitly.
>
> I think it is OK that the uapi and kernel api have different
> structs. The uapi focused one should hold the uapi related data, which
> is what you've shown here, I think.
>
> > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> >
> > struct attach_info {
> > u32 ioasid;
> > // If valid, the PASID to be used physically
> > u32 pasid;
> > };
> > int ioasid_device_attach(struct ioasid_dev *dev,
> > struct attach_info info);
> > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
>
> Honestly, I still prefer this to be highly explicit as this is where
> all device driver authors get invovled:
>
> ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev,
> u32 ioasid);
> ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid,
> struct ioasid_dev *dev, u32 ioasid);

Then better naming it as pci_device_attach_ioasid since the 1st parameter
is struct pci_device?

By keeping physical_pasid as a pointer, you want to remove the last helper
function (ioasid_get_global_pasid) so the global pasid is returned along
with the attach function?

>
> And presumably a variant for ARM non-PCI platform (?) devices.
>
> This could boil down to a __ioasid_device_attach() as you've shown.
>
> > A new object is introduced and linked to ioasid_data->attach_data for
> > each successful attach operation:
> >
> > struct ioasid_attach_data {
> > struct list_head next;
> > struct ioasid_dev *dev;
> > u32 pasid;
> > }
>
> This should be returned as a pointer and detatch should be:
>
> int ioasid_device_detach(struct ioasid_attach_data *);

ok

>
> > As explained in the design section, there is no explicit group enforcement
> > in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> > implicit group check - before every device within an iommu group is
> > attached to this IOASID, the previously-attached devices in this group are
> > put in ioasid_data->partial_devices. The IOASID rejects any command if
> > the partial_devices list is not empty.
>
> It is simple enough. Would be good to design in a diagnostic string so
> userspace can make sense of the failure. Eg return something like
> -EDEADLK and provide an ioctl 'why did EDEADLK happen' ?
>

Make sense.

>
> > Then is the last helper function:
> > u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> > u32 ioasid, bool alloc);
> >
> > ioasid_get_global_pasid is necessary in scenarios where multiple devices
> > want to share a same PASID value on the attached I/O page table (e.g.
> > when ENQCMD is enabled, as explained in next section). We need a
> > centralized place (ioasid_data->pasid) to hold this value (allocated when
> > first called with alloc=true). vfio device driver calls this function (alloc=
> > true) to get the global PASID for an ioasid before calling ioasid_device_
> > attach. KVM also calls this function (alloc=false) to setup PASID translation
> > structure when user calls KVM_MAP_PASID.
>
> When/why would the VFIO driver do this? isn't this just some varient
> of pasid_attach?
>
> ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32
> *physical_pasid, struct ioasid_dev *dev, u32 ioasid);
>
> ?

will adopt this way.

>
> > 4. PASID Virtualization
> >
> > When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> > created on the assigned vfio device. This leads to the concepts of
> > "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> > by the guest to mark an GVA address space while pPASID is the one
> > selected by the host and actually routed in the wire.
> >
> > vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
>
> Should the vPASID programmed into the IOASID before calling
> VFIO_ATTACH_IOASID?

No. As explained in earlier reply, when multiple devices are attached
to the same IOASID the guest may link the page table to different
vPASID# cross attached devices. Anyway vPASID is a per-RID thing.

>
> > vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> > device, with two factors to be considered:
> >
> > - Whether vPASID is directly used (vPASID==pPASID) in the wire, or
> > should be instead converted to a newly-allocated one (vPASID!=
> > pPASID);
> >
> > - If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
> > space or a global PASID space (implying sharing pPASID cross devices,
> > e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
> > as part of the process context);
>
> This whole section is 4 really confusing. I think it would be more
> understandable to focus on the list below and minimize the vPASID
>
> > The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> > supported. There are three possible scenarios:
> >
> > (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> > policies.)
>
> This has become unclear. I think this should start by identifying the
> 6 main type of devices and how they can use pPASID/vPASID:
>
> 0) Device is a RID and cannot issue PASID
> 1) Device is a mdev and cannot issue PASID
> 2) Device is a mdev and programs a single fixed PASID during bind,
> does not accept PASID from the guest

There are no vPASID per se in above 3 types. So this section only
focus on the latter 3 types. But I can include them in next version
if it sets the tone clearer.

>
> 3) Device accepts any PASIDs from the guest. No
> vPASID/pPASID translation is possible. (classic vfio_pci)
> 4) Device accepts any PASID from the guest and has an
> internal vPASID/pPASID translation (enhanced vfio_pci)

what is enhanced vfio_pci? In my writing this is for mdev
which doesn't support ENQCMD

> 5) Device accepts and PASID from the guest and relys on
> external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev)
>
> 0-2 don't use vPASID at all
>
> 3-5 consume a vPASID but handle it differently.
>
> I think the 3-5 map into what you are trying to explain in the table
> below, which is the rules for allocating the vPASID depending on which
> of device types 3-5 are present and or mixed.

Exactly

>
> For instance device type 3 requires vPASID == pPASID because it can't
> do translation at all.
>
> This probably all needs to come through clearly in the /dev/ioasid
> interface. Once the attached devices are labled it would make sense to
> have a 'query device' /dev/ioasid IOCTL to report the details based on
> how the device attached and other information.

This is a good point. Another benefit of having a device label.

for 0-2 the device will report no PASID support. Although this may duplicate
with other information (e.g. PCI PASID cap), this provides a vendor-agnostic
way for reporting details around IOASID.

for 3-5 the device will report PASID support. In these cases the user is
expected to always provide a vPASID.

for 5 in addition the device will report a requirement on CPU PASID
translation. For such device the user should talk to KVM to setup the PASID
mapping. This way the user doesn't need to know whether a device is
pdev or mdev. Just follows what device capability reports.

>
> > 2) mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> >
> > PASIDs are also used by kernel to mark the default I/O address space
> > for mdev, thus cannot be delegated to the guest. Instead, the mdev
> > driver must allocate a new pPASID for each vPASID (thus vPASID!=
> > pPASID) and then use pPASID when attaching this mdev to an ioasid.
>
> I don't understand this at all.. What does "PASIDs are also used by
> the kernel" mean?

Just refer to your type-2. Because PASIDs on this device are already used
by the parent driver to mark mdev, we cannot delegate the per-RID space
to the guest.

>
> > The mdev driver needs cache the PASID mapping so in mediation
> > path vPASID programmed by the guest can be converted to pPASID
> > before updating the physical MMIO register.
>
> This is my scenario #4 above. Device and internally virtualize
> vPASID/pPASID - how that is done is up to the device. But this is all
> just labels, when such a device attaches, it should use some specific
> API:
>
> ioasid_pci_device_vpasid_attach(struct pci_device *pdev,
> u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid);

yes.

>
> And then maintain its internal translation
>
> > In previous thread a PASID range split scheme was discussed to support
> > this combination, but we haven't worked out a clean uAPI design yet.
> > Therefore in this proposal we decide to not support it, implying the
> > user should have some intelligence to avoid such scenario. It could be
> > a TODO task for future.
>
> It really just boils down to how to allocate the PASIDs to get around
> the bad viommu interface that assumes all PASIDs are usable by all
> devices.

viommu (e.g. Intel VT-d) has good interface to restrict how many PASIDs
are available to the guest. There is a PASID size filed in the viommu
register. Here the puzzle is just about how to design a good uAPI to
handle this mixed scenario where vPASID/pPASID are in split range and
must be linked to the same I/O page table together.

I'll see whether this can be afforded after addressing other comments
in this section.

>
> > In spite of those subtle considerations, the kernel implementation could
> > start simple, e.g.:
> >
> > - v==p for pdev;
> > - v!=p and always use a global PASID pool for all mdev's;
>
> Regardless all this mess needs to be hidden from the consuming drivers
> with some simple APIs as above. The driver should indicate what its HW
> can do and the PASID #'s that magically come out of /dev/ioasid should
> be appropriate.
>

Yes, I see how it should work now.

Thanks
Kevin