Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication

From: Niklas Schnelle
Date: Tue Dec 06 2022 - 05:15:18 EST


On Mon, 2022-12-05 at 18:24 +0000, Robin Murphy wrote:
> On 2022-12-02 14:29, Niklas Schnelle wrote:
> > On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote:
> > > On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote:
> > > > On 2022-11-29 12:00, Niklas Schnelle wrote:
> > > > > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
> > > > > > On 2022-11-16 17:16, Niklas Schnelle wrote:
> > > > > > > When RPCIT indicates that the underlying hypervisor has run out of
> > > > > > > resources it often means that its IOVA space is exhausted and IOVAs need
> > > > > > > to be freed before new ones can be created. By triggering a flush of the
> > > > > > > IOVA queue we can get the queued IOVAs freed and also get the new
> > > > > > > mapping established during the global flush.
> > > > > >
> > > > > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
> > > > > > exhausted and fail the DMA API call before even getting as far as
> > > > > > iommu_map(), though? Or is there some less obvious limitation like a
> > > > > > maximum total number of distinct IOVA regions regardless of size?
> > > > >
> > > > > Well, yes and no. Your thinking is of course correct if the advertised
> > > > > available IOVA space can be fully utilized without exhausting
> > > > > hypervisor resources we won't trigger this case. However sadly there
> > > > > are complications. The most obvious being that in QEMU/KVM the
> > > > > restriction of the IOVA space to what QEMU can actually have mapped at
> > > > > once was just added recently[0] prior to that we would regularly go
> > > > > through this "I'm out of resources free me some IOVAs" dance with our
> > > > > existing DMA API implementation where this just triggers an early cycle
> > > > > of freeing all unused IOVAs followed by a global flush. On z/VM I know
> > > > > of no situations where this is triggered. That said this signalling is
> > > > > architected so z/VM may have corner cases where it does this. On our
> > > > > bare metal hypervisor (no paging) this return code is unused and IOTLB
> > > > > flushes are simply hardware cache flushes as on bare metal platforms.
> > > > >
> > > > > [0]
> > > > > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@xxxxxxxxxxxxx/
> > > >
> > > > That sheds a bit more light, thanks, although I'm still not confident I
> > > > fully understand the whole setup. AFAICS that patch looks to me like
> > > > it's putting a fixed limit on the size of the usable address space. That
> > > > in turn implies that "free some IOVAs and try again" might be a red
> > > > herring and never going to work; for your current implementation, what
> > > > that presumably means in reality is "free some IOVAs, resetting the
> > > > allocator to start allocating lower down in the address space where it
> > > > will happen to be below that limit, and try again", but the iommu-dma
> > > > allocator won't do that. If it doesn't know that some arbitrary range
> > > > below the top of the driver-advertised aperture is unusable, it will
> > > > just keep allocating IOVAs up there and mappings will always fail.
> > > >
> > > > If the driver can't accurately represent the usable IOVA space via the
> > > > aperture and/or reserved regions, then this whole approach seems doomed.
> > > > If on the other hand I've misunderstood and you can actually still use
> > > > any address, just not all of them at the same time,
> > >
> > >
> > > This is exactly it, the problem is a limit on the number of IOVAs that
> > > are concurrently mapped. In QEMU pass-through the tightest limit is
> > > usually the one set by the host kernel parameter
> > > vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With
> > > IOMMU_DOMAIN_DMA we stay under this limit without extra action but once
> > > there is a flush queue (including the existing per-CPU one) where each
> > > entry may keep many pages lazily unmapped this is easly hit with fio
> > > bandwidth tests on an NVMe. For this case this patch works reliably
> > > because of course the number of actually active mappings without the
> > > lazily freed ones is similar to the number of active ones with
> > > IOMMU_DOMAIN_DMA.
> > >
> > > > then it might in
> > > > fact be considerably easier to skip the flush queue mechanism entirely
> > > > and implement this internally to the driver - basically make .iotlb_sync
> > > > a no-op for non-strict DMA domains,
> > >
> > > I'm assuming you mean .iotlb_sync_map above.
>
> No, I did mean .iotlb_sync, however on reflection that was under the
> assumption that it's OK for the hypervisor to see a new mapping for a
> previously-used IOVA without having seen it explicitly unmapped in
> between. Fair enough if that isn't the case, but if it is then your
> pagetable can essentially act as the "flush queue" by itself.

Hmm, I see. We do want the hypervisor to see mappings as invalid before
they are changed again to a new valid mapping. In case of e.g. QEMU/KVM
this allows the hypervisor to itself rely on the hardware to fill in
the IOTLB on map i.e. use a no-op .iotlb_sync_map and a .iotlb_sync
that flushes the hardware IOTLB. Also this allows QEMU to emulate the
IOMMU with just map/unmap primitives on vfio/iommufd. Consequently, I
recently found that vfio-pci pass-through works in combination with an
emulated s390 guest on an x86_64 host albeit very slowly of course.

>
> > > > put the corresponding RPCIT flush
> > > > and retry in .sync_map, then allow that to propagate an error back to
> > > > iommu_map() if the new mapping still hasn't taken.
> > > >
> > > > Thanks,
> > > > Robin.
> > >
> > > Hmm, interesting. This would leave the IOVAs in the flush queue lazily
> > > unmapped and thus still block their re-use but free their host
> > > resources via a global RPCIT allowing the guest to use a different
> > > porition of the IOVA space with those resources. It could work, though
> > > I need to test it, but it feels a bit clunky.
> > >
> > > Maybe we can go cleaner while using this idea of not having to flush
> > > the queue but just freeing their host side resources. If we allowed
> > > .iotlb_sync_map to return an error that fails the mapping operation,
> > > then we could do it all in there. In the normal case it just does the
> > > RPCIT but if that returns that the hypervisor ran out of resources it
> > > does another global RPCIT allowing the hypervisor to free IOVAs that
> > > were lazily unmapped. If the latter succeeds all is good if not then
> > > the mapping operation failed. Logically it makes sense too,
> > > .iotlb_sync_map is the final step of syncing the mapping to the host
> > > which can fail just like the mapping operation itself.
> > >
> > > Apart from the out of active IOVAs case this would also handle the
> > > other useful error case when using .iotlb_sync_map for shadowing where
> > > it fails because the host ran against a pinned pages limit or out of
> > > actual memory. Not by fixing it but at least we would get a failed
> > > mapping operation.
> > >
> > > The other callbacks .flush_iotlb_all and .iotlb_sync
> > > could stay the same as they are only used for unmapped pages where we
> > > can't reasonably run out of resources in the host neither active IOVAs
> > > nor pinned pages.
> > >
> >
> > Ok, I've done some testing with the above idea and this seems to work
> > great. I've verified that my version of QEMU (without Matt's IOVA
> > aperture resrtriction patch) creates the RPCIT out of resource
> > indications and then the global flush in .iotlb_sync_map is triggered
> > and allows QEMU to unpin pages and free IOVAs while the guest still has
> > them lazily unpapeg (sitting in the flush queue) and thus uses
> > different IOVAs.
> >
> > @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it
> > can return and error and then failing the mapping operation I think
> > this is a great approach. One advantage over the previous approach of
> > flushing the queue isthat this should work for the pure IOMMU API too.
>
> Whatever happens I think allowing .iotlb_sync_map to propagate an error
> out through iommu_map() is an appropriate thing to do - it sounds like
> s390 might technically need that for regular IOMMU API correctness in
> some circumstances anyway. Besides, even in the cases where it
> represents "simple" TLB maintenance, there are potentially ways that
> could fail (like timing out if the IOMMU has gone completely wrong), so
> it wouldn't seem entirely unreasonable if a driver might want to report
> overall failure if it can't guarantee that the new mapping will actually
> be usable.
>
> Thanks,
> Robin.
>

Great then I'll use this approach for v3 or maybe even sent this
separately as a prerequisite IOMMU fix as yes this is also required for
the IOMMU API to work in a guest where the hypervisor may report
running out of resources.
>