Re: Question about reserved_regions w/ Intel IOMMU

From: Alexander Duyck
Date: Thu Jun 08 2023 - 13:11:38 EST


On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <ashok_raj@xxxxxxxxxxxxxxx> wrote:
>
> On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote:
> > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On 6/8/23 7:03 AM, Alexander Duyck wrote:
> > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
> > > > <alexander.duyck@xxxxxxxxx> wrote:
> > > >>
> > > >> I am running into a DMA issue that appears to be a conflict between
> > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is
> > > >> supposed to create reserved regions for MSI and the memory window
> > > >> behind the root port. However looking at reserved_regions I am not
> > > >> seeing that. I only see the reservation for the MSI.
> > > >>
> > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
> > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
> > > >> 0x00000000fee00000 0x00000000feefffff msi
> > > >>
> > > >> Shouldn't there also be a memory window for the region behind the root
> > > >> port to prevent any possible peer-to-peer access?
> > > >
> > > > Since the iommu portion of the email bounced I figured I would fix
> > > > that and provide some additional info.
> > > >
> > > > I added some instrumentation to the kernel to dump the resources found
> > > > in iova_reserve_pci_windows. From what I can tell it is finding the
> > > > correct resources for the Memory and Prefetchable regions behind the
> > > > root port. It seems to be calling reserve_iova which is successfully
> > > > allocating an iova to reserve the region.
> > > >
> > > > However still no luck on why it isn't showing up in reserved_regions.
> > >
> > > Perhaps I can ask the opposite question, why it should show up in
> > > reserve_regions? Why does the iommu subsystem block any possible peer-
> > > to-peer DMA access? Isn't that a decision of the device driver.
> > >
> > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > which is not related to peer-to-peer accesses.
> >
> > The problem is if the IOVA overlaps with the physical addresses of
> > other devices that can be routed to via ACS redirect. As such if ACS
> > redirect is enabled a host IOVA could be directed to another device on
> > the switch instead. To prevent that we need to reserve those addresses
> > to avoid address space collisions.

Our test case is just to perform DMA to/from the host on one device on
a switch and what we are seeing is that when we hit an IOVA that
matches up with the physical address of the neighboring devices BAR0
then we are seeing an AER followed by a hot reset.

> Any untranslated address from a device must be forwarded to the IOMMU when
> ACS is enabled correct?I guess if you want true p2p, then you would need
> to map so that the hpa turns into the peer address.. but its always a round
> trip to IOMMU.

This assumes all parts are doing the Request Redirect "correctly". In
our case there is a PCIe switch we are trying to debug and we have a
few working theories. One concern I have is that the switch may be
throwing an ACS violation for us using an address that matches a
neighboring device instead of redirecting it to the upstream port. If
we pull the switch and just run on the root complex the issue seems to
be resolved so I started poking into the code which led me to the
documentation pointing out what is supposed to be reserved based on
the root complex and MSI regions.

As a part of going down that rabbit hole I realized that the
reserved_regions seems to only list the MSI reservation. However after
digging a bit deeper it seems like there is code to reserve the memory
behind the root complex in the IOVA but it doesn't look like that is
visible anywhere and is the piece I am currently trying to sort out.
What I am working on is trying to figure out if the system that is
failing is actually reserving that memory region in the IOVA, or if
that is somehow not happening in our test setup.