Re: Question about reserved_regions w/ Intel IOMMU

From: Alexander Duyck
Date: Fri Jun 16 2023 - 11:28:03 EST


On Fri, Jun 16, 2023 at 5:20 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote:
> > +Alex
> >
> > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > Sent: Tuesday, June 13, 2023 11:54 PM
> > >
> > > On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote:
> > >
> > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > > > which is not related to peer-to-peer accesses.
> > > >
> > > > Right, in general the IOMMU driver cannot be held responsible for
> > > whatever
> > > > might happen upstream of the IOMMU input.
> > >
> > > The driver yes, but..
> > >
> > > > The DMA layer carves PCI windows out of its IOVA space
> > > > unconditionally because we know that they *might* be problematic,
> > > > and we don't have any specific constraints on our IOVA layout so
> > > > it's no big deal to just sacrifice some space for simplicity.
> > >
> > > This is a problem for everything using UNMANAGED domains. If the iommu
> > > API user picks an IOVA it should be able to expect it to work. If the
> > > intereconnect fails to allow it to work then this has to be discovered
> > > otherwise UNAMANGED domains are not usable at all.
> > >
> > > Eg vfio and iommufd are also in trouble on these configurations.
> > >
> >
> > If those PCI windows are problematic e.g. due to ACS they belong to
> > a single iommu group. If a vfio user opens all the devices in that group
> > then it can discover and reserve those windows in its IOVA space.
>
> How? We don't even exclude the single device's BAR if there is no ACS?

The issue here was a defective ACS on a PCIe switch.

> > The problem is that the user may not open all the devices then
> > currently there is no way for it to know the windows on those
> > unopened devices.
> >
> > Curious why nobody complains about this gap before this thread...
>
> Probably because it only matters if you have a real PCIe switch in the
> system, which is pretty rare.

So just FYI I am pretty sure we have a partitioned PCIe switch that
has FW issues. Specifically it doesn't seem to be honoring the
Redirect Request bit so what is happening is that we are seeing
requests that are supposed to be going to the root complex/IOMMU
getting redirected to an NVMe device that was on the same physical
PCIe switch. We are in the process of getting that sorted out now and
are using the forcedac option in the meantime to keep the IOMMU out of
the 32b address space that was causing the issue.

The reason for my original request is more about the user experience
of trying to figure out what is reserved and what isn't. It seems like
the IOVA will have reservations that are not visible to the end user.
So when I go looking through the reserved_regions in sysfs it just
lists the MSI regions that are reserved, and maybe some regions such
as the memory for USB. while in reality we may be reserving IOVA
regions in iova_reserve_pci_windows that will not be exposed without
having to add probes or adding some printk debugging.