RE: [RFC] /dev/ioasid uAPI proposal

From: Tian, Kevin
Date: Thu Jun 03 2021 - 01:19:19 EST


> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Sent: Thursday, June 3, 2021 12:15 PM
>
> On Thu, 3 Jun 2021 03:22:27 +0000
> "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
>
> > > From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > Sent: Thursday, June 3, 2021 10:51 AM
> > >
> > > On Wed, 2 Jun 2021 19:45:36 -0300
> > > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > > >
> > > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > > from the guest page table... what page table?
> > > >
> > > > I see my confusion now, the phrasing in your earlier remark led me
> > > > think this was about allowing the no-snoop performance enhancement
> in
> > > > some restricted way.
> > > >
> > > > It is really about blocking no-snoop 100% of the time and then
> > > > disabling the dangerous wbinvd when the block is successful.
> > > >
> > > > Didn't closely read the kvm code :\
> > > >
> > > > If it was about allowing the optimization then I'd expect the guest to
> > > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > > hypervisor and plumb the whole thing through. Hence my remark about
> > > > the guest page tables..
> > > >
> > > > So really the test is just 'were we able to block it' ?
> > >
> > > Yup. Do we really still consider that there's some performance benefit
> > > to be had by enabling a device to use no-snoop? This seems largely a
> > > legacy thing.
> >
> > Yes, there is indeed performance benefit for device to use no-snoop,
> > e.g. 8K display and some imaging processing path, etc. The problem is
> > that the IOMMU for such devices is typically a different one from the
> > default IOMMU for most devices. This special IOMMU may not have
> > the ability of enforcing snoop on no-snoop PCI traffic then this fact
> > must be understood by KVM to do proper mtrr/pat/wbinvd virtualization
> > for such devices to work correctly.
>
> The case where the IOMMU does not support snoop-control for such a
> device already works fine, we can't prevent no-snoop so KVM will
> emulate wbinvd. The harder one is if we should opt to allow no-snoop
> even if the IOMMU does support snoop-control.

In other discussion we are leaning toward a per-device capability
reporting scheme through /dev/ioasid (or /dev/iommu as the new
name). It seems natural to also allow setting a capability e.g. no-
snoop for a device if underlying IOMMU driver allows it.

>
> > > > > This support existed before mdev, IIRC we needed it for direct
> > > > > assignment of NVIDIA GPUs.
> > > >
> > > > Probably because they ignored the disable no-snoop bits in the control
> > > > block, or reset them in some insane way to "fix" broken bioses and
> > > > kept using it even though by all rights qemu would have tried hard to
> > > > turn it off via the config space. Processing no-snoop without a
> > > > working wbinvd would be fatal. Yeesh
> > > >
> > > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > > had..
> > > >
> > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY
> > > and !IOMMU_CAP_CACHE_COHERENCY
> > > > domains.
> > > >
> > > > This doesn't actually matter. If you mix them together then kvm
> > > > will turn on wbinvd anyhow, so we don't need to use the
> DMA_PTE_SNP
> > > > anywhere in this VM.
> > > >
> > > > This if two IOMMU's are joined together into a single /dev/ioasid
> > > > then we can just make them both pretend to be
> > > > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> > >
> > > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY
> then
> > > we
> > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > > available based on the per domain support available. That gives us the
> > > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > > because they used to have a device attached where the domain required
> > > it and we can't atomically remap with new flags to perform the same as
> > > a VM that never had that device attached in the first place.
> > >
> > > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > > >
> > > > What we want to do here is iterate over every ioasid associated
> > > > with the group fd that is passed into kvm.
> > >
> > > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > > then attaching a device to an allocated ioasid (non-fd)... I assume
> > > you're talking about the latter ioasid.
> > >
> > > > Today the group fd has a single container which specifies the
> > > > single ioasid so this is being done trivially.
> > > >
> > > > To reorg we want to get the ioasid from the device not the
> > > > group (see my note to David about the groups vs device rational)
> > > >
> > > > This is just iterating over each vfio_device in the group and
> > > > querying the ioasid it is using.
> > >
> > > The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> > > anyway, we still need to account for all the RIDs and aliases of a
> > > group.
> > >
> > > > Or perhaps more directly: an op attaching the vfio_device to the
> > > > kvm and having some simple helper
> > > > '(un)register ioasid with kvm (kvm, ioasid)'
> > > > that the vfio_device driver can call that just sorts this out.
> > >
> > > We could almost eliminate the device notion altogether here, use an
> > > ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> > > change to the composition of the device set for the ioasid, which is
> > > why we currently do it on addition or removal of a group, where the
> > > group has a consistent set of IOMMU properties. Register a notifier
> > > callback via the ioasidfd? Thanks,
> > >
> >
> > When discussing I/O page fault support in another thread, the consensus
> > is that an device handle will be registered (by user) or allocated (return
> > to user) in /dev/ioasid when binding the device to ioasid fd. From this
> > angle we can register {ioasid_fd, device_handle} to KVM and then call
> > something like ioasidfd_device_is_coherent() to get the property.
> > Anyway the coherency is a per-device property which is not changed
> > by how many I/O page tables are attached to it.
>
> The mechanics are different, but this is pretty similar in concept to
> KVM learning coherence using the groupfd today. Do we want to
> compromise on kernel control of wbinvd emulation to allow userspace to
> make such decisions? Ownership of a device might be reason enough to
> allow the user that privilege. Thanks,
>

I think so. In the end it's still decided by the underlying IOMMU driver.
If the IOMMU driver doesn't allow user to opt for no-snoop, it's exactly
same as today's groupfd approach. Otherwise an user-opted policy
implies that the decision is delegated to userspace.

Thanks
Kevin