Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

From: Yi Liu
Date: Thu Dec 14 2023 - 06:24:50 EST


On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for flushing iotlb for the nested domain
> allocated with IOMMU_HWPT_DATA_VTD_S1 type.
>
> This only supports invalidating IOTLB, but no for device-TLB as device-TLB
> invalidation will be covered automatically in the IOTLB invalidation if the
> underlying IOMMU driver has enabled ATS for the affected device.
>
> Signed-off-by: Yi Liu <yi.l.liu@xxxxxxxxx>
> ---
> include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 7f92cecc87d7..cafd98642abf 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -614,6 +614,42 @@ struct iommu_hwpt_get_dirty_bitmap {
> #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
> IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
>
> +/**
> + * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
> + * stage-1 cache invalidation
> + * @IOMMU_VTD_INV_FLAGS_LEAF: The LEAF flag indicates whether only the
> + * leaf PTE caching needs to be invalidated
> + * and other paging structure caches can be
> + * preserved.
> + */
> +enum iommu_hwpt_vtd_s1_invalidate_flags {
> + IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0,
> +};
> +
> +/**
> + * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
> + * (IOMMU_HWPT_DATA_VTD_S1)
> + * @addr: The start address of the addresses to be invalidated. It needs
> + * to be 4KB aligned.
> + * @npages: Number of contiguous 4K pages to be invalidated.
> + * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
> + * @__reserved: Must be 0
> + *
> + * The Intel VT-d specific invalidation data for user-managed stage-1 cache
> + * invalidation in nested translation. Userspace uses this structure to
> + * tell the impacted cache scope after modifying the stage-1 page table.
> + *
> + * Invalidating all the caches related to the page table by setting @addr
> + * to be 0 and @npages to be __aligned_u64(-1). This includes the
> + * corresponding device-TLB if ATS is enabled on the attached devices.
> + */
> +struct iommu_hwpt_vtd_s1_invalidate {
> + __aligned_u64 addr;
> + __aligned_u64 npages;
> + __u32 flags;
> + __u32 __reserved;
> +};
> +
> /**
> * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
> * @size: sizeof(struct iommu_hwpt_invalidate)

Hi Jason, Kevin,

Per the prior discussion[1], we agreed to move the error reporting into the
driver specific part. On Intel side, we want to report two devTLB
invalidation errors: ICE (invalid completion error) and ITE (invalidation
timeout error). Such errors have an additional SID information to tell
which device failed the devTLB invalidation. I've got the below structure.

+/**
+ * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
+ * (IOMMU_HWPT_DATA_VTD_S1)
+ * @addr: The start address of the addresses to be invalidated. It needs
+ * to be 4KB aligned.
+ * @npages: Number of contiguous 4K pages to be invalidated.
+ * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
+ * @__reserved: Must be 0.
+ * @error: One of enum iommu_hwpt_vtd_s1_invalidate_error_code
+ * @dev_id: The device in the invalidation completion message, it's meaninfful
+ * when @error_code is set to IOMMU_HWPT_VTD_S1_INVALIDATE_DEVTLB_ICE
+ * or IOMMU_HWPT_VTD_S1_INVALIDATE_DEVTLB_ITE.
+ *
+ * The Intel VT-d specific invalidation data for user-managed stage-1 cache
+ * invalidation in nested translation. Userspace uses this structure to
+ * tell the impacted cache scope after modifying the stage-1 page table.
+ *
+ * Invalidating all the caches related to the page table by setting @addr
+ * to be 0 and @npages to be U64_MAX.
+ *
+ * @error_code is meaningful only if the request is handled by kernel. This
+ * can be known by checking struct iommu_hwpt_invalidate::req_num output.
+ * @error_code only covers the errors detected by hardware after submitting
+ * the invalidation. The software detected errors would go through the normal
+ * ioctl errno.
+ */
+struct iommu_hwpt_vtd_s1_invalidate {
+ __aligned_u64 addr;
+ __aligned_u64 npages;
+ __u32 flags;
+ __u32 __reserved;
+ __u32 error;
+ __u32 dev_id;
+};

dev_id is used to report the failed device, userspace should be able to map
it to a vRID, and inject it to VM as part of ITE/ICE error.

However, I got a problem when trying to get dev_id in cache invalidation
path, since this is filled in intel iommu driver. Seems like there is no
good way for it. I've below alternatives to move forward, wish you have
a look.

- Reporting pSID instead of dev_id. This may not work if userspace for
example Qemu cen get a vfio device cdev fd from management stack. Maybe you
have different opinion, do let me know.

- Let iommufd to convert a SID info or device pointer to a dev_id, and then
report it back to userspace. This seems easiest, but breaks layer and also
requires vt-d specific logic. :(

- Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
maintained, then intel iommu can report a vRID back to user. But intel
iommu driver does not have viommu context, no place to hold the vRID->pRID
mapping. TBH. It may require other reasons to introduce it other than the
error reporting need. Anyhow, this requires more thinking and also has
dependency even if it is doable in intel side.

- Only report error code, but no device info at first. May adding the
device info (dev_id or vRID) in future series. In reality, the existing
Qemu vIOMMU does not report ICE, ITE, neither the SID to VM. Also, VT-d
spec defined the ICE/ITE errors first in 2007 spec 1.1, and added SID info
later in 2019 spec 3.1. We may do it in stage as well.

What about your opinion?

[1] https://lore.kernel.org/linux-iommu/a9699f71-805a-4a5a-9282-3ec52e5bc81a@xxxxxxxxx/

--
Regards,
Yi Liu