Re: [RFC] /dev/ioasid uAPI proposal

From: Alex Williamson
Date: Wed Jun 02 2021 - 22:51:01 EST


On Wed, 2 Jun 2021 19:45:36 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
>
> > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > from the guest page table... what page table?
>
> I see my confusion now, the phrasing in your earlier remark led me
> think this was about allowing the no-snoop performance enhancement in
> some restricted way.
>
> It is really about blocking no-snoop 100% of the time and then
> disabling the dangerous wbinvd when the block is successful.
>
> Didn't closely read the kvm code :\
>
> If it was about allowing the optimization then I'd expect the guest to
> enable no-snoopable regions via it's vIOMMU and realize them to the
> hypervisor and plumb the whole thing through. Hence my remark about
> the guest page tables..
>
> So really the test is just 'were we able to block it' ?

Yup. Do we really still consider that there's some performance benefit
to be had by enabling a device to use no-snoop? This seems largely a
legacy thing.

> > This support existed before mdev, IIRC we needed it for direct
> > assignment of NVIDIA GPUs.
>
> Probably because they ignored the disable no-snoop bits in the control
> block, or reset them in some insane way to "fix" broken bioses and
> kept using it even though by all rights qemu would have tried hard to
> turn it off via the config space. Processing no-snoop without a
> working wbinvd would be fatal. Yeesh
>
> But Ok, back the /dev/ioasid. This answers a few lingering questions I
> had..
>
> 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> domains.
>
> This doesn't actually matter. If you mix them together then kvm
> will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> anywhere in this VM.
>
> This if two IOMMU's are joined together into a single /dev/ioasid
> then we can just make them both pretend to be
> !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.

Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
available based on the per domain support available. That gives us the
most consistent behavior, ie. we don't have VMs emulating wbinvd
because they used to have a device attached where the domain required
it and we can't atomically remap with new flags to perform the same as
a VM that never had that device attached in the first place.

> 2) How to fit this part of kvm in some new /dev/ioasid world
>
> What we want to do here is iterate over every ioasid associated
> with the group fd that is passed into kvm.

Yeah, we need some better names, binding a device to an ioasid (fd) but
then attaching a device to an allocated ioasid (non-fd)... I assume
you're talking about the latter ioasid.

> Today the group fd has a single container which specifies the
> single ioasid so this is being done trivially.
>
> To reorg we want to get the ioasid from the device not the
> group (see my note to David about the groups vs device rational)
>
> This is just iterating over each vfio_device in the group and
> querying the ioasid it is using.

The IOMMU API group interfaces is largely iommu_group_for_each_dev()
anyway, we still need to account for all the RIDs and aliases of a
group.

> Or perhaps more directly: an op attaching the vfio_device to the
> kvm and having some simple helper
> '(un)register ioasid with kvm (kvm, ioasid)'
> that the vfio_device driver can call that just sorts this out.

We could almost eliminate the device notion altogether here, use an
ioasidfd_for_each_ioasid() but we really want a way to trigger on each
change to the composition of the device set for the ioasid, which is
why we currently do it on addition or removal of a group, where the
group has a consistent set of IOMMU properties. Register a notifier
callback via the ioasidfd? Thanks,

Alex