Re: Question about reserved_regions w/ Intel IOMMU

From: Alexander Duyck
Date: Thu Jun 08 2023 - 14:16:35 EST


On Thu, Jun 8, 2023 at 10:52 AM Ashok Raj <ashok.raj@xxxxxxxxx> wrote:
>
> On Thu, Jun 08, 2023 at 10:10:54AM -0700, Alexander Duyck wrote:
> > On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <ashok_raj@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote:
> > > > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On 6/8/23 7:03 AM, Alexander Duyck wrote:
> > > > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
> > > > > > <alexander.duyck@xxxxxxxxx> wrote:
> > > > > >>
> > > > > >> I am running into a DMA issue that appears to be a conflict between
> > > > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is
> > > > > >> supposed to create reserved regions for MSI and the memory window
> > > > > >> behind the root port. However looking at reserved_regions I am not
> > > > > >> seeing that. I only see the reservation for the MSI.
> > > > > >>
> > > > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
> > > > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
> > > > > >> 0x00000000fee00000 0x00000000feefffff msi
> > > > > >>
> > > > > >> Shouldn't there also be a memory window for the region behind the root
> > > > > >> port to prevent any possible peer-to-peer access?
> > > > > >
> > > > > > Since the iommu portion of the email bounced I figured I would fix
> > > > > > that and provide some additional info.
> > > > > >
> > > > > > I added some instrumentation to the kernel to dump the resources found
> > > > > > in iova_reserve_pci_windows. From what I can tell it is finding the
> > > > > > correct resources for the Memory and Prefetchable regions behind the
> > > > > > root port. It seems to be calling reserve_iova which is successfully
> > > > > > allocating an iova to reserve the region.
> > > > > >
> > > > > > However still no luck on why it isn't showing up in reserved_regions.
> > > > >
> > > > > Perhaps I can ask the opposite question, why it should show up in
> > > > > reserve_regions? Why does the iommu subsystem block any possible peer-
> > > > > to-peer DMA access? Isn't that a decision of the device driver.
> > > > >
> > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > > > which is not related to peer-to-peer accesses.
> > > >
> > > > The problem is if the IOVA overlaps with the physical addresses of
> > > > other devices that can be routed to via ACS redirect. As such if ACS
> > > > redirect is enabled a host IOVA could be directed to another device on
> > > > the switch instead. To prevent that we need to reserve those addresses
> > > > to avoid address space collisions.
> >
> > Our test case is just to perform DMA to/from the host on one device on
> > a switch and what we are seeing is that when we hit an IOVA that
> > matches up with the physical address of the neighboring devices BAR0
> > then we are seeing an AER followed by a hot reset.
>
> ACS is always confusing.. Does your NIC have a DTLB?

No. It is using the IOMMU for all address translation. I am also
pushing back on the test being used as well. It is always possible
they have implemented something incorrectly and are overrunning a
buffer going into the reserved IOVA region and the overlap is just a
coincidence.

> If request redirect is set, and the Egress is enabled, then all
> transactions should go upstream to the root-port->IOMMU before being
> served.
>
> In my 6.0 spec its in 6.12.3 ACS Peer-to-Peer Control Interactions?
>
> And maybe lspci would show how things are setup in the switch?

We were setting the Redirect Request only, no Egress. I agree, based
on the config everything should just go upstream. However if we
eliminate the switch or put things in passthrough mode the problem
goes away.

> >
> > > Any untranslated address from a device must be forwarded to the IOMMU when
> > > ACS is enabled correct?I guess if you want true p2p, then you would need
> > > to map so that the hpa turns into the peer address.. but its always a round
> > > trip to IOMMU.
> >
> > This assumes all parts are doing the Request Redirect "correctly". In
> > our case there is a PCIe switch we are trying to debug and we have a
> > few working theories. One concern I have is that the switch may be
> > throwing an ACS violation for us using an address that matches a
> > neighboring device instead of redirecting it to the upstream port. If
> > we pull the switch and just run on the root complex the issue seems to
> > be resolved so I started poking into the code which led me to the
> > documentation pointing out what is supposed to be reserved based on
> > the root complex and MSI regions.
> >
> > As a part of going down that rabbit hole I realized that the
> > reserved_regions seems to only list the MSI reservation. However after
> > digging a bit deeper it seems like there is code to reserve the memory
> > behind the root complex in the IOVA but it doesn't look like that is
> > visible anywhere and is the piece I am currently trying to sort out.
> > What I am working on is trying to figure out if the system that is
> > failing is actually reserving that memory region in the IOVA, or if
> > that is somehow not happening in our test setup.
>
> I suspect with IOMMU, there is no need to pluck holes like we do for the
> MSI. In very early code in IOMMU i vaguely recall we did that, but our
> knowledge on ACS was weak. (not that has improved :-)).

The hole has to do mostly with avoiding any possibility of misrouting
things, or at least that was my understanding after reading it.

> Knowing how the switch and root ports are setup with forwarding may help
> with some clues. The easy option is maybe forcibly adding to the reserved
> range may help to see if you don't see the ACS violation.
>
> Baolu might have some better ideas.

I'm working with the team having the issue to try and verify that now.
In theory it should already be reserved so I am working with them to
check that.

Thanks,

- Alex