RE: Plan for /dev/ioasid RFC v2

From: Tian, Kevin
Date: Fri Jun 25 2021 - 06:27:26 EST


Hi, Alex/Joerg/Jason,

Want to draw your attention on an updated proposal below. Let's see
whether there is a converged direction to move forward. 😊

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Saturday, June 19, 2021 2:23 AM
>
> On Fri, Jun 18, 2021 at 04:57:40PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > Sent: Friday, June 18, 2021 8:20 AM
> > >
> > > On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
> > >
> > > > I've referred to this as a limitation of type1, that we can't put
> > > > devices within the same group into different address spaces, such as
> > > > behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> > > > As isolation support improves we see fewer multi-device groups, this
> > > > scenario becomes the exception. Buy better hardware to use the
> devices
> > > > independently.
> > >
> > > This is basically my thinking too, but my conclusion is that we should
> > > not continue to make groups central to the API.
> > >
> > > As I've explained to David this is actually causing functional
> > > problems and mess - and I don't see a clean way to keep groups central
> > > but still have the device in control of what is happening. We need
> > > this device <-> iommu connection to be direct to robustly model all
> > > the things that are in the RFC.
> > >
> > > To keep groups central someone needs to sketch out how to solve
> > > today's mdev SW page table and mdev PASID issues in a clean
> > > way. Device centric is my suggestion on how to make it clean, but I
> > > haven't heard an alternative??
> > >
> > > So, I view the purpose of this discussion to scope out what a
> > > device-centric world looks like and then if we can securely fit in the
> > > legacy non-isolated world on top of that clean future oriented
> > > API. Then decide if it is work worth doing or not.
> > >
> > > To my mind it looks like it is not so bad, granted not every detail is
> > > clear, and no code has be sketched, but I don't see a big scary
> > > blocker emerging. An extra ioctl or two, some special logic that
> > > activates for >1 device groups that looks a lot like VFIO's current
> > > logic..
> > >
> > > At some level I would be perfectly fine if we made the group FD part
> > > of the API for >1 device groups - except that complexifies every user
> > > space implementation to deal with that. It doesn't feel like a good
> > > trade off.
> > >
> >
> > Would it be an acceptable tradeoff by leaving >1 device groups
> > supported only via legacy VFIO (which is anyway kept for backward
> > compatibility), if we think such scenario is being deprecated over
> > time (thus little value to add new features on it)? Then all new
> > sub-systems including vdpa and new vfio only support singleton
> > device group via /dev/iommu...
>
> That might just be a great idea - userspace has to support those APIs
> anyhow, if it can be made trivially obvious to use this fallback even
> though /dev/iommu is available it is a great place to start. It also
> means PASID/etc are naturally blocked off.
>
> Maybe years down the road we will want to harmonize them, so I would
> still sketch it out enough to be confident it could be implemented..
>

First let's align on the high level goal of supporting multi-devices group
via IOMMU fd. Based on previous discussions I feel it's fair to say that
we will not provide new features beyond what vfio group delivers today,
which implies:

1) All devices within the group must share the same address space.

Though it's possible to support multiple address spaces (e.g. if caused
by !ACS), there are some scenarios (DMA aliasing, RID sharing, etc.)
where a single address space is mandatory. The effort to support
multiple spaces is not worthwhile due to improved isolation over time.

2) It's not necessary to bind all devices within the group to the IOMMU fd.

Other devices could be left unused, or bound to a known driver which
doesn't do DMA. This implies a group viability mechanism must be in
place which can identify when the group is viable for operation and
BUG_ON() when the viability is changed due to user action.

3) User must be denied from accessing a device before its group is attached
to a known security context.

If above goals are agreed, below is the updated proposal for supporting
multi-devices group via device-centric API. Most ideas come from Jason.
Here try to expand and compose them in a full picture.

In general:

- vfio keeps existing uAPI sequence, with slightly different semantics:

a) VFIO_GROUP_SET_CONTAINER, as today

b) VFIO_SET_IOMMU with a new iommu type (VFIO_EXTERNAL_
IOMMU) which, once set, tells VFIO not to establish its own
security context.

c) VFIO_GROUP_GET_DEVICE_FD_NEW, carrying additional info
about external iommu driver (iommu_fd, device_cookie). This
call automatically binds the device to iommu_fd. Device fd is
returned to the user only after successful binding which implies
a security context (BLOCK_DMA) has been established for the
entire group. Since the security context is managed by iommu_fd,
group viable check should be done in the iommu layer thus
vfio_group_viable() mechanism is redundant in this case.

- When receiving the binding call for the 1st device in a group, iommu_fd
calls iommu_group_set_block_dma(group, dev->driver) which does
several things:

a) Check group viability. A group is viable only when all devices in
the group are in one of below states:

* driver-less
* bound to a driver which is same as dev->driver (vfio in this case)
* bound to an otherwise allowed driver (same list as in vfio)

b) Set block_dma flag for the group and configure the IOMMU to block
DMA for all devices in this group. This could be done by attaching to
a dedicated iommu domain (IOMMU_DOMAIN_BLOCKED) which has
an empty page table.

c) The iommu layer also verifies group viability on BUS_NOTIFY_
BOUND_DRIVER event. BUG_ON if viability is broken while block_dma
is set.

- Binding other devices in the group to iommu_fd just succeeds since
the group is already in block_dma.

- When a group is in block_dma state, all devices in the group (even not
bound to iommu_fd) switch together between blocked domain and
IOASID domain, initiated by attaching to or detaching from an IOASID.

a) iommu_fd verifies that all bound devices in the same group must be
attached to a single IOASID.

b) the 1st device attach in the group calls iommu API to move the
entire group to use the new IOASID domain.

c) the last device detach calls iommu API to move the entire group
back to the blocked domain.

- A device is allowed to be unbound from iommu_fd when other devices
in the group are still bound. In this case the group is still in block_dma
state thus the unbound device should not be bound to another driver
which could break the group viability.

a) for vfio this unbound is automatically done when device fd is closed.

- When vfio requests to unbind the last device in the group, iommu_fd
calls iommu_group_unset_block_dma(group) to move the group out
of the block_dma state. Devices in the group are re-attached to the
default domain from now on.

With this design all the helper functions and uAPI are kept device-centric
in iommu_fd. It maintains minimal group knowledge internally by tracking
device binding/attaching status within each group and then calling proper
iommu API upon changed group status.

VFIO still keeps its container/group/device semantics for backward
compatibility.

A new subsystem can completely eliminate group semantics as long as
it could find a way to finish device binding before granting user to
access the device.

Thanks
Kevin