Re: [PATCH v3 1/1] PCI: Add translated request only flag for pci_enable_pasid()

From: Baolu Lu
Date: Wed Feb 01 2023 - 01:11:57 EST


On 2023/2/1 8:14, Bjorn Helgaas wrote:
On Tue, Jan 31, 2023 at 08:56:13PM +0800, Baolu Lu wrote:
On 2023/1/31 2:38, Bjorn Helgaas wrote:
PCI: Add translated request only flag for pci_enable_pasid()

The PCIe fabric routes Memory Requests based on the TLP address, ignoring
the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
Enable PASID only when ACS RR & UF enabled on upstream path") requires
some ACS features being supported on device's upstream path when enabling
PCI/PASID.

Looking up 201007ef707a to see what ensuring system integrity means,
it prevents Memory Requests with PASID, which should always be routed
to the RC, from being mistakenly routed as peer-to-peer requests.

Yes, exactly. I will update above paragraph like below

The PCIe fabric routes Memory Requests based on the TLP address, ignoring
the PASID. In order to prevent Memory Requests with PASID, which should
always be routed to the RC, from being mistakenly routed as peer-to-peer
requests, commit 201007ef707a ("PCI: Enable PASID only when ACS RR & UF
enabled on upstream path") requires some ACS features being supported on
device's upstream path when enabling PCI/PASID.



However, above change causes the Linux kernel boots to black screen on a
system with below graphic device:

We need a PCIe concept-level description of the issue first, i.e., in
terms of DMA, PASID, ACS, etc. Then we can mention the AMD GPU issue
as an instance.

How about below description?

Thanks, this is exactly the sort of thing I'm looking for. But my
understanding of ATS/PRI/PASID is weak, so I'm still working through
this. Tell me when I say something wrong below...

PCIe endpoints can use ATS to request DMA remapping hardware to
translate an IOVA to its mapped physical address. If the translation is
missing or the permissions are insufficient, the PRI is used to trigger
an I/O page fault. The IOMMU driver will fill the mapping with desired
permissions and return the translated address to the device.

In PCIe spec language, I think you're saying that a PCIe Function may
contain an ATC. If the ATC Capability Enable bit is set, the Function
can issue Translation Requests.

The TA (aka IOMMU) will respond with a Translation Completion. If the
Completion is a CplD, it contains the translated address and the
Function can store the entry in its ATC. I assume the I/O page fault
case corresponds to a Cpl (with no data) meaning that the TA could not
translate the address.

If the TA doesn't have a mapping with the desired permissions, and the
Function's Page Request Capability Enable bit is set, it may issue a
Page Request Message. It's up to the TA/IOMMU to make this message
visible to the OS, which can make the page resident, create an IOMMU
mapping, and enable a PRG Response Message. After the Function
receives the PRG Response Message, it would issue another Translation
Request.

The translated address is specified by the IOMMU driver. The IOMMU
driver ensures that the address is a DMA buffer address instead of any
P2P address in the PCI fabric. Therefore, any translated memory request
will eventually be routed to IOMMU regardless of whether there is ACS
control in the up-streaming path.

A Memory Request with an address that is not a P2P address, i.e., it
is not contained in any bridge aperture, will *always* be routed
toward the RC, won't it? Isn't that the case regardless of whether
the address is translated or untranslated, and even regardless of ACS?

IIUC, ACS basically causes peer-to-peer requests to be routed upstream
instead of directly to the peer.

OK, reading this again, I realize that I just restated exactly what
you had already written, sorry about that.

Never mind. It's really a good chance for me to learn how to describe
this from the perspective of the PCI subsystem.. :-)


AMD GPU is one of those devices.

I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities?
And furthermore, that the GPU *always* uses Translated addresses with
PASID?

So I guess what's going on here is that if:

- A device only uses PASID with Translated addresses, and
- those Translated addresses are never P2P addresses, then
- those transactions will always be routed to the RC.

And this applies even if there is no ACS or ACS doesn't support
PCI_ACS_RR and PCI_ACS_UF.

Yes.


The black screen happens because ... ?

What can we include in the commit log to help people find this fix? I
see these in the bugzilla:

WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80
WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50

(These look like defects in pdev_pri_ats_enable(), so really just
distractions)

kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
kfd kfd: amdgpu: device 1002:9874 NOT added due to errors
BUG: kernel NULL pointer dereference, address: 0000000000000058
RIP: 0010:report_iommu_fault+0x11/0x90

I couldn't figure out the NULL pointer dereference. I expected it to
be from a BUG() or similar in report_iommu_fault(), but I don't see
that.

Above has been answered by Vasant Hegde in a separated reply.

Best regards,
baolu