Re: Question about reserved_regions w/ Intel IOMMU

From: Alexander Duyck
Date: Tue Jun 20 2023 - 10:59:30 EST


On Mon, Jun 19, 2023 at 7:02 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> On Mon, Jun 19, 2023 at 11:20:58AM +0100, Robin Murphy wrote:
> > On 2023-06-16 19:59, Jason Gunthorpe wrote:
> > > On Fri, Jun 16, 2023 at 05:34:53PM +0100, Robin Murphy wrote:
> > > >
> > > > If the system has working ACS configured correctly, then this issue should
> > > > be moot;
> > >
> > > Yes
> > >
> > > > if it doesn't, then a VFIO user is going to get a whole group of
> > > > peer devices if they're getting anything at all, so it doesn't seem entirely
> > > > unreasonable to leave it up to them to check that all those devices'
> > > > resources play well with their expected memory map.
> > >
> > > I think the kernel should be helping here.. 'go figure it out from
> > > lspci' is a very convoluted and obscure uAPI, and I don't see things
> > > like DPDK actually doing that.
> > >
> > > IMHO the uAPI expectation is that the kernel informs userspace what
> > > the usable IOVA is, if bridge windows and lack of ACS are rendering
> > > address space unusable then VFIO/iommufd should return it as excluded
> > > as well.
> > >
> > > If we are going to do that then all UNAMANGED domain users should
> > > follow the same logic.
> > >
> > > We probably have avoided bug reports because of how rare it would be
> > > to see a switch and an UNMANAGED domain using scenario together -
> > > especially with ACS turned off.
> > >
> > > So it is really narrow niche.. Obscure enough I'm not going to make
> > > patches :)
> >
> > The main thing is that we've already been round this once before; we tried
> > it 6 years ago and then reverted it a year later for causing more problems
> > than it solved:
>
> As I said earlier in this thread if we do it for VFIO then the
> calculation must be precise and consider bus details like
> ACS/etc. eg VFIO on an ACS system should not report any new regions.
>
> It looks like that thread confirms we can't create reserved regions
> which are wrong :)
>
> I think Alex is saying the same things I'm saying in that thread too:
>
> https://lore.kernel.org/all/20180226161310.061ce3a8@xxxxxxxxx/
>
> (b) is what the kernel should help prevent.
>
> And it is clear there are today scenarios where a VFIO user will get
> data loss because the reported valid IOVA from the kernel is
> incorrect. Fixing this is hard, much harder than what commit
> 273df9635385 ("iommu/dma: Make PCI window reservation generic") has.

I think this may have gone off down a rathole as my original question
wasn't anything about adding extra reserved regions. It was about
exposing what the IOVA is already reserving so it could be user
visible.

The issue was that the reservation(s) didn't appear in the
reserved_regions sysfs file, and it required adding probes or printk
debugging in order to figure out what is reserved and what is not.
Specifically what I was trying to point out is that there are regions
reserved in iova_reserve_pci_windows() that are not user/admin
visible. The function reserve_iova doesn't do anything to track the
reservation so that it can be recalled later for display. It made
things harder to debug as I wasn't sure if the addresses I was seeing
were valid for the IOMMU or not since I didn't know if they were
supposed to be reserved and the documentation I had found implied they
were.