Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

From: Yi Liu
Date: Tue Nov 21 2023 - 22:50:16 EST


On 2023/11/22 10:32, Baolu Lu wrote:
On 11/21/23 8:17 PM, Jason Gunthorpe wrote:
On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe <jgg@xxxxxxxxxx>
Sent: Tuesday, November 21, 2023 7:05 AM

On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
From: Liu, Yi L <yi.l.liu@xxxxxxxxx>
Sent: Friday, November 17, 2023 9:18 PM

This adds the data structure for flushing iotlb for the nested domain
allocated with IOMMU_HWPT_DATA_VTD_S1 type.

This only supports invalidating IOTLB, but no for device-TLB as device-TLB
invalidation will be covered automatically in the IOTLB invalidation if the
underlying IOMMU driver has enabled ATS for the affected device.

"no for device-TLB" is misleading. Here just say that cache invalidation
request applies to both IOTLB and device TLB (if ATS is enabled ...)

I think we should forward the ATS invalidation from the guest too?
That is what ARM and AMD will have to do, can we keep them all
consistent?

I understand Intel keeps track of enough stuff to know what the RIDs
are, but is it necessary to make it different?

probably ask the other way. Now intel-iommu driver always flushes
iotlb and device tlb together then is it necessary to separate them
in uAPI for no good (except doubled syscalls)? :)

I wish I knew more about Intel CC design to be able to answer that :|

Doesn't the VM issue the ATC flush command regardless? How does it
know it has a working ATC but does not need to flush it?


The Intel VT-d spec doesn't require the driver to flush iotlb and device
tlb together.

Spec has below description. Although it does not say iotlb and device tlb
should be flushed together, but there is indeed requirement that both should be flushed when a page is unmapped.

Chapter 6.5.2.5:
"Since translation requests-without-PASID from a device may be serviced by hardware from the
IOTLB, software must always request IOTLB invalidation (iotlb_inv_dsc) before requesting
corresponding Device-TLB (dev_tlb_inv_dsc) invalidation."

Therefore, the current approach of relying on caching mode
to determine whether device TLB invalidation is necessary appears to be
a performance optimization rather than an architectural requirement.

The vIOMMU driver assumes that it is running within a VM guest when
caching mode is enabled. This assumption leads to an omission of device
TLB invalidation, relying on the hypervisor to perform a combined flush
of the IOLB and device TLB.

yes, this is what the current intel iommu driver does. However, whether
rely on caching mode or not is orthogonal with whether we need to uapis
here. I think guest iommu driver could submit both iotlb and device tlb
invalidation request. But Qemu could select if it needs to forward the
device tlb invalidation request to kernel if kernel iommu driver has
already covered the device tlb invalidation when get the request to
invalidate iotlb.

While this optimization aims to reduce VMEXIT overhead, it introduces
potential issues:

- When a Linux guest running on a hypervisor other than KVM/QEMU, the
  assumption of combined IOLB and device TLB flushing by the hypervisor
  may be incorrect, potentially leading to missed device TLB
  invalidation.

Hmmm, this appears to be an intel iommu driver bug, it should submit both
iotlb invalidation and device tlb invalidation requests. But as above, I
think this is orthogonal here. KVM/QEMU could define its own uapi based on
the implementation to gain the best suit.


- The caching mode doesn't apply to first-stage translation. Therefore,
  if the driver uses first-stage translation and still relies on caching
  mode to determine device TLB invalidation, the optimization fails.

yes, caching mode does no apply to first-stage translation table. But in
nested translation, guest does not need to notify hypervisor when there is
page unmapped. is it? So whether caching mode applies to first-stage
translation table does not matter. TBH. I didn't see the problem due to
this reason. But I agree that linux guest intel iommu driver needs to
submit both iotlb and device tlb invalidation request to guarantee it can
work on other hypervisors. And there should be other way to do the
performance optimization.


A more reasonable optimization would be to allocate a bit in the iommu
capability registers. The vIOMMU driver could then leverage this bit to
determine whether it could eliminate a device invalidation request.

this may be something spec can be enhanced. But again it is just to make
guest intel iommu driver to gain performance optimization and also can
work on other hypervisors. As of this uapi design, considering it within
the linux ecosystem is enough.

--
Regards,
Yi Liu