Re: Question about reserved_regions w/ Intel IOMMU

From: Ashok Raj
Date: Thu Jun 08 2023 - 13:52:20 EST


On Thu, Jun 08, 2023 at 10:10:54AM -0700, Alexander Duyck wrote:
> On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <ashok_raj@xxxxxxxxxxxxxxx> wrote:
> >
> > On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote:
> > > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On 6/8/23 7:03 AM, Alexander Duyck wrote:
> > > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
> > > > > <alexander.duyck@xxxxxxxxx> wrote:
> > > > >>
> > > > >> I am running into a DMA issue that appears to be a conflict between
> > > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is
> > > > >> supposed to create reserved regions for MSI and the memory window
> > > > >> behind the root port. However looking at reserved_regions I am not
> > > > >> seeing that. I only see the reservation for the MSI.
> > > > >>
> > > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
> > > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
> > > > >> 0x00000000fee00000 0x00000000feefffff msi
> > > > >>
> > > > >> Shouldn't there also be a memory window for the region behind the root
> > > > >> port to prevent any possible peer-to-peer access?
> > > > >
> > > > > Since the iommu portion of the email bounced I figured I would fix
> > > > > that and provide some additional info.
> > > > >
> > > > > I added some instrumentation to the kernel to dump the resources found
> > > > > in iova_reserve_pci_windows. From what I can tell it is finding the
> > > > > correct resources for the Memory and Prefetchable regions behind the
> > > > > root port. It seems to be calling reserve_iova which is successfully
> > > > > allocating an iova to reserve the region.
> > > > >
> > > > > However still no luck on why it isn't showing up in reserved_regions.
> > > >
> > > > Perhaps I can ask the opposite question, why it should show up in
> > > > reserve_regions? Why does the iommu subsystem block any possible peer-
> > > > to-peer DMA access? Isn't that a decision of the device driver.
> > > >
> > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > > which is not related to peer-to-peer accesses.
> > >
> > > The problem is if the IOVA overlaps with the physical addresses of
> > > other devices that can be routed to via ACS redirect. As such if ACS
> > > redirect is enabled a host IOVA could be directed to another device on
> > > the switch instead. To prevent that we need to reserve those addresses
> > > to avoid address space collisions.
>
> Our test case is just to perform DMA to/from the host on one device on
> a switch and what we are seeing is that when we hit an IOVA that
> matches up with the physical address of the neighboring devices BAR0
> then we are seeing an AER followed by a hot reset.

ACS is always confusing.. Does your NIC have a DTLB?

If request redirect is set, and the Egress is enabled, then all
transactions should go upstream to the root-port->IOMMU before being
served.

In my 6.0 spec its in 6.12.3 ACS Peer-to-Peer Control Interactions?

And maybe lspci would show how things are setup in the switch?

>
> > Any untranslated address from a device must be forwarded to the IOMMU when
> > ACS is enabled correct?I guess if you want true p2p, then you would need
> > to map so that the hpa turns into the peer address.. but its always a round
> > trip to IOMMU.
>
> This assumes all parts are doing the Request Redirect "correctly". In
> our case there is a PCIe switch we are trying to debug and we have a
> few working theories. One concern I have is that the switch may be
> throwing an ACS violation for us using an address that matches a
> neighboring device instead of redirecting it to the upstream port. If
> we pull the switch and just run on the root complex the issue seems to
> be resolved so I started poking into the code which led me to the
> documentation pointing out what is supposed to be reserved based on
> the root complex and MSI regions.
>
> As a part of going down that rabbit hole I realized that the
> reserved_regions seems to only list the MSI reservation. However after
> digging a bit deeper it seems like there is code to reserve the memory
> behind the root complex in the IOVA but it doesn't look like that is
> visible anywhere and is the piece I am currently trying to sort out.
> What I am working on is trying to figure out if the system that is
> failing is actually reserving that memory region in the IOVA, or if
> that is somehow not happening in our test setup.

I suspect with IOMMU, there is no need to pluck holes like we do for the
MSI. In very early code in IOMMU i vaguely recall we did that, but our
knowledge on ACS was weak. (not that has improved :-)).

Knowing how the switch and root ports are setup with forwarding may help
with some clues. The easy option is maybe forcibly adding to the reserved
range may help to see if you don't see the ACS violation.

Baolu might have some better ideas.

--
Cheers,
Ashok

Bike Shedding: (a.k.a Parkinson's Law of Triviality)
- When the discussion on a topic is inversely proportionate to the gravity of
the topic.