Re: Question about reserved_regions w/ Intel IOMMU

From: Robin Murphy
Date: Thu Jun 08 2023 - 14:02:19 EST


On 2023-06-08 18:10, Alexander Duyck wrote:
On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <ashok_raj@xxxxxxxxxxxxxxx> wrote:

On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote:
On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx> wrote:

On 6/8/23 7:03 AM, Alexander Duyck wrote:
On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
<alexander.duyck@xxxxxxxxx> wrote:

I am running into a DMA issue that appears to be a conflict between
ACS and IOMMU. As per the documentation I can find, the IOMMU is
supposed to create reserved regions for MSI and the memory window
behind the root port. However looking at reserved_regions I am not
seeing that. I only see the reservation for the MSI.

So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
# cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
0x00000000fee00000 0x00000000feefffff msi

Shouldn't there also be a memory window for the region behind the root
port to prevent any possible peer-to-peer access?

Since the iommu portion of the email bounced I figured I would fix
that and provide some additional info.

I added some instrumentation to the kernel to dump the resources found
in iova_reserve_pci_windows. From what I can tell it is finding the
correct resources for the Memory and Prefetchable regions behind the
root port. It seems to be calling reserve_iova which is successfully
allocating an iova to reserve the region.

However still no luck on why it isn't showing up in reserved_regions.

Perhaps I can ask the opposite question, why it should show up in
reserve_regions? Why does the iommu subsystem block any possible peer-
to-peer DMA access? Isn't that a decision of the device driver.

The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
which is not related to peer-to-peer accesses.

The problem is if the IOVA overlaps with the physical addresses of
other devices that can be routed to via ACS redirect. As such if ACS
redirect is enabled a host IOVA could be directed to another device on
the switch instead. To prevent that we need to reserve those addresses
to avoid address space collisions.

Our test case is just to perform DMA to/from the host on one device on
a switch and what we are seeing is that when we hit an IOVA that
matches up with the physical address of the neighboring devices BAR0
then we are seeing an AER followed by a hot reset.

Any untranslated address from a device must be forwarded to the IOMMU when
ACS is enabled correct?I guess if you want true p2p, then you would need
to map so that the hpa turns into the peer address.. but its always a round
trip to IOMMU.

This assumes all parts are doing the Request Redirect "correctly". In
our case there is a PCIe switch we are trying to debug and we have a
few working theories. One concern I have is that the switch may be
throwing an ACS violation for us using an address that matches a
neighboring device instead of redirecting it to the upstream port. If
we pull the switch and just run on the root complex the issue seems to
be resolved so I started poking into the code which led me to the
documentation pointing out what is supposed to be reserved based on
the root complex and MSI regions.

As a part of going down that rabbit hole I realized that the
reserved_regions seems to only list the MSI reservation. However after
digging a bit deeper it seems like there is code to reserve the memory
behind the root complex in the IOVA but it doesn't look like that is
visible anywhere and is the piece I am currently trying to sort out.
What I am working on is trying to figure out if the system that is
failing is actually reserving that memory region in the IOVA, or if
that is somehow not happening in our test setup.

How old's the kernel? Before 5.11, intel-iommu wasn't hooked up to iommu-dma so didn't do quite the same thing - it only reserved whatever specific PCI memory resources existed at boot, rather than the whole window as iommu-dma does. Either way, ftrace on reserve_iova() (or just whack a print in there) should suffice to see what's happened.

Robin.