Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space

From: Baolu Lu
Date: Sun Jun 25 2023 - 23:11:55 EST


On 6/26/23 3:21 AM, Nicolin Chen wrote:
On Sun, Jun 25, 2023 at 02:30:46PM +0800, Baolu Lu wrote:
External email: Use caution opening links or attachments


On 2023/5/31 2:50, Nicolin Chen wrote:
Hi Baolu,

On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:

This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. The use case is nested
translation, where modern IOMMU hardware supports two-stage translation
tables. The second-stage translation table is managed by the host VMM
while the first-stage translation table is owned by the user space.
Hence, any IO page fault that occurs on the first-stage page table
should be delivered to the user space and handled there. The user space
should respond the page fault handling result to the device top-down
through the IOMMUFD response uAPI.

User space indicates its capablity of handling IO page faults by setting
a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
will then setup its infrastructure for page fault delivery. Together
with the iopf-capable flag, user space should also provide an eventfd
where it will listen on any down-top page fault messages.

On a successful return of the allocation of iopf-capable HWPT, a fault
fd will be returned. User space can open and read fault messages from it
once the eventfd is signaled.
I think that, whether the guest has an IOPF capability or not,
the host should always forward any stage-1 fault/error back to
the guest. Yet, the implementation of this series builds with
the IOPF framework that doesn't report IOMMU_FAULT_DMA_UNRECOV.

And I have my doubt at the using the IOPF framework with that
IOMMU_PAGE_RESP_ASYNC flag: using the IOPF framework is for
its bottom half workqueue, because a page response could take
a long cycle. But adding that flag feels like we don't really
need the bottom half workqueue, i.e. losing the point of using
the IOPF framework, IMHO.

Combining the two facts above, I wonder if we really need to
go through the IOPF framework; can't we just register a user
fault handler in the iommufd directly upon a valid event_fd?
Agreed. We should avoid workqueue in sva iopf framework. Perhaps we
could go ahead with below code? It will be registered to device with
iommu_register_device_fault_handler() in IOMMU_DEV_FEAT_IOPF enabling
path. Un-registering in the disable path of cause.
Well, for a virtualization use case, I still think it's should
be registered in iommufd.

Emm.. you suggest iommufd calls iommu_register_device_fault_handler() to
register its own page fault handler, right?

I have a different opinion, iommu_register_device_fault_handler() is
called to register a fault handler for a device. It should be called
or initiated by a device driver. The iommufd only needs to install a
per-domain io page fault handler.

I am considering a use case on Intel platform. Perhaps it's similar
on other platforms. An SIOV-capable device can support host SVA and
assigning mediated devices to user space at the same time. Both host
SVA and mediated devices require IOPF. So there will be multiple places
where a page fault handler needs to be registered.

Having a device without an IOPF/PRI
capability, a guest OS should receive some faults too, if that
device causes a translation failure.

Yes. DMA faults are also a consideration. But I would like to have it
supported in a separated series. As I explained in the previous reply,
we also need to consider the software nested translation case.


And for a vSVA use case, the IOMMU_DEV_FEAT_IOPF feature only
gets enabled in the guest VM right? How could the host enable
the IOMMU_DEV_FEAT_IOPF to trigger this handler?

As mentioned above, this should be initiated by the kernel device
driver, vfio or possible mediated device driver.

Best regards,
baolu