Re: [PATCH RFC v2 00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

From: Jason Wang
Date: Tue Aug 11 2020 - 23:28:47 EST



On 2020/8/10 下午3:32, Tian, Kevin wrote:
From: Jason Gunthorpe <jgg@xxxxxxxxxx>
Sent: Friday, August 7, 2020 8:20 PM

On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:

If you see this as an abuse of the framework, then let's identify those
specific issues and come up with a better approach. As we've discussed
before, things like basic PCI config space emulation are acceptable
overhead and low risk (imo) and some degree of register emulation is
well within the territory of an mdev driver.
What troubles me is that idxd already has a direct userspace interface
to its HW, and does userspace DMA. The purpose of this mdev is to
provide a second direct userspace interface that is a little different
and trivially plugs into the virtualization stack.
No. Userspace DMA and subdevice passthrough (what mdev provides)
are two distinct usages IMO (at least in idxd context). and this might
be the main divergence between us, thus let me put more words here.
If we could reach consensus in this matter, which direction to go
would be clearer.

First, a passthrough interface requires some unique requirements
which are not commonly observed in an userspace DMA interface, e.g.:

- Tracking DMA dirty pages for live migration;
- A set of interfaces for using SVA inside guest;
* PASID allocation/free (on some platforms);
* bind/unbind guest mm/page table (nested translation);
* invalidate IOMMU cache/iotlb for guest page table changes;
* report page request from device to guest;
* forward page response from guest to device;
- Configuring irqbypass for posted interrupt;
- ...

Second, a passthrough interface requires delegating raw controllability
of subdevice to guest driver, while the same delegation might not be
required for implementing an userspace DMA interface (especially for
modern devices which support SVA). For example, idxd allows following
setting per wq (guest driver may configure them in any combination):
- put in dedicated or shared mode;
- enable/disable SVA;
- Associate guest-provided PASID to MSI/IMS entry;
- set threshold;
- allow/deny privileged access;
- allocate/free interrupt handle (enlightened for guest);
- collect error status;
- ...

We plan to support idxd userspace DMA with SVA. The driver just needs
to prepare a wq with a predefined configuration (e.g. shared, SVA,
etc.), bind the process mm to IOMMU (non-nested) and then map
the portal to userspace. The goal that userspace can do DMA to
associated wq doesn't change the fact that the wq is still *owned*
and *controlled* by kernel driver. However as far as passthrough
is concerned, the wq is considered 'owned' by the guest driver thus
we need an interface which can support low-level *controllability*
from guest driver. It is sort of a mess in uAPI when mixing the
two together.


So for userspace drivers like DPDK, it can use both of the two uAPIs?



Based on above two reasons, we see distinct requirements between
userspace DMA and passthrough interfaces, at least in idxd context
(though other devices may have less distinction in-between). Therefore,
we didn't see the value/necessity of reinventing the wheel that mdev
already handles well to evolve an simple application-oriented usespace
DMA interface to a complex guest-driver-oriented passthrough interface.
The complexity of doing so would incur far more kernel-side changes
than the portion of emulation code that you've been concerned about...
I don't think VFIO should be the only entry point to
virtualization. If we say the universe of devices doing user space DMA
must also implement a VFIO mdev to plug into virtualization then it
will be alot of mdevs.
Certainly VFIO will not be the only entry point. and This has to be a
case-by-case decision.


The problem is that if we tie all controls via VFIO uAPI, the other subsystem like vDPA is likely to duplicate them. I wonder if there is a way to decouple the vSVA out of VFIO uAPI?


If an userspace DMA interface can be easily
adapted to be a passthrough one, it might be the choice.


It's not that easy even for VFIO which requires a lot of new uAPIs and infrastructures(e.g mdev) to be invented.


But for idxd,
we see mdev a much better fit here, given the big difference between
what userspace DMA requires and what guest driver requires in this hw.


A weak point for mdev is that it can't serve kernel subsystem other than VFIO. In this case, you need some other infrastructures (like [1]) to do this.

(For idxd, you probably don't need this, but it's pretty common in the case of networking or storage device.)

Thanks

[1] https://patchwork.kernel.org/patch/11280547/



I would prefer to see that the existing userspace interface have the
extra needed bits for virtualization (eg by having appropriate
internal kernel APIs to make this easy) and all the emulation to build
the synthetic PCI device be done in userspace.
In the end what decides the direction is the amount of changes that
we have to put in kernel, not whether we call it 'emulation'. For idxd,
adding special passthrough requirements (guest SVA, dirty tracking,
etc.) and raw controllability to the simple userspace DMA interface
is for sure making kernel more complex than reusing the mdev
framework (plus some degree of emulation mockup behind). Not to
mention the merit of uAPI compatibility with mdev...

Thanks
Kevin