Re: [RFC] /dev/ioasid uAPI proposal

From: Alex Williamson
Date: Thu Jun 03 2021 - 16:02:11 EST


On Thu, 3 Jun 2021 09:34:01 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Wed, Jun 02, 2021 at 08:50:54PM -0600, Alex Williamson wrote:
> > On Wed, 2 Jun 2021 19:45:36 -0300
> > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >
> > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > >
> > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > from the guest page table... what page table?
> > >
> > > I see my confusion now, the phrasing in your earlier remark led me
> > > think this was about allowing the no-snoop performance enhancement in
> > > some restricted way.
> > >
> > > It is really about blocking no-snoop 100% of the time and then
> > > disabling the dangerous wbinvd when the block is successful.
> > >
> > > Didn't closely read the kvm code :\
> > >
> > > If it was about allowing the optimization then I'd expect the guest to
> > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > hypervisor and plumb the whole thing through. Hence my remark about
> > > the guest page tables..
> > >
> > > So really the test is just 'were we able to block it' ?
> >
> > Yup. Do we really still consider that there's some performance benefit
> > to be had by enabling a device to use no-snoop? This seems largely a
> > legacy thing.
>
> I've recently had some no-snoopy discussions lately.. The issue didn't
> vanish, it is still expensive going through all that cache hardware.
>
> > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > had..
> > >
> > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> > > domains.
> > >
> > > This doesn't actually matter. If you mix them together then kvm
> > > will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > > anywhere in this VM.
> > >
> > > This if two IOMMU's are joined together into a single /dev/ioasid
> > > then we can just make them both pretend to be
> > > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> >
> > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > available based on the per domain support available. That gives us the
> > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > because they used to have a device attached where the domain required
> > it and we can't atomically remap with new flags to perform the same as
> > a VM that never had that device attached in the first place.
>
> I think we are saying the same thing..

Hrm? I think I'm saying the opposite of your "both not set
IOMMU_CACHE". IOMMU_CACHE is the mapping flag that enables
DMA_PTE_SNP. Maybe you're using IOMMU_CACHE as the state reported to
KVM?

> > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > >
> > > What we want to do here is iterate over every ioasid associated
> > > with the group fd that is passed into kvm.
> >
> > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > then attaching a device to an allocated ioasid (non-fd)... I assume
> > you're talking about the latter ioasid.
>
> Fingers crossed on RFCv2.. Here I mean the IOASID object inside the
> /dev/iommu FD. The vfio_device would have some kref handle to the
> in-kernel representation of it. So we can interact with it..
>
> > > Or perhaps more directly: an op attaching the vfio_device to the
> > > kvm and having some simple helper
> > > '(un)register ioasid with kvm (kvm, ioasid)'
> > > that the vfio_device driver can call that just sorts this out.
> >
> > We could almost eliminate the device notion altogether here, use an
> > ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> > change to the composition of the device set for the ioasid, which is
> > why we currently do it on addition or removal of a group, where the
> > group has a consistent set of IOMMU properties.
>
> That is another quite good option, just forget about trying to be
> highly specific and feed in the /dev/ioasid FD and have kvm ask "does
> anything in here not enforce snoop?"
>
> With something appropriate to track/block changing that answer.
>
> It doesn't solve the problem to connect kvm to AP and kvmgt though

It does not, we'll probably need a vfio ioctl to gratuitously announce
the KVM fd to each device. I think some devices might currently fail
their open callback if that linkage isn't already available though, so
it's not clear when that should happen, ie. it can't currently be a
VFIO_DEVICE ioctl as getting the device fd requires an open, but this
proposal requires some availability of the vfio device fd without any
setup, so presumably that won't yet call the driver open callback.
Maybe that's part of the attach phase now... I'm not sure, it's not
clear when the vfio device uAPI starts being available in the process
of setting up the ioasid. Thanks,

Alex