Re: [RFC] /dev/ioasid uAPI proposal

From: Alex Williamson
Date: Fri Jun 04 2021 - 11:26:27 EST


[Cc +Paolo]

On Fri, 4 Jun 2021 09:28:30 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Fri, Jun 04, 2021 at 08:38:26AM +0000, Tian, Kevin wrote:
> > > I think more to drive the replacement design; if we can't figure out
> > > how to do something other than backwards compatibility trickery in the
> > > kernel, it's probably going to bite us. Thanks,
> >
> > I'm a bit lost on the desired flow in your minds. Here is one flow based
> > on my understanding of this discussion. Please comment whether it
> > matches your thinking:
> >
> > 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;
> >
> > 1) Qemu binds dev1 to ioasid_fd;
> >
> > 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
> > CACHE info i.e. whether underlying IOMMU can enforce snoop;
> >
> > 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
> > needs to figure out whether dev1 wants to do no-snoop. This might
> > be based a fixed vendor/class list or specified by user;
> >
> > 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
> > flag is specified to decide the page table format, which is supposed
> > to match dev1;
>
> > 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this
> > point, specify snoop/no-snoop again. If not supported by related
> > iommu or different from what gpa_ioasid has, attach fails.
>
> Why do we need to specify it again?

My thought as well.

> If the IOASID was created with the "block no-snoop" flag then it is
> blocked in that IOASID, and that blocking sets the page table format.
>
> The only question is if we can successfully attach a device to the
> page table, or not.
>
> The KVM interface is a bit tricky because Alex said this is partially
> security, wbinvd is only enabled if someone has a FD to a device that
> can support no-snoop.
>
> Personally I think this got way too complicated, the KVM interface
> should simply be
>
> ioctl(KVM_ALLOW_INCOHERENT_DMA, ioasidfd, device_label)
> ioctl(KVM_DISALLOW_INCOHERENT_DMA, ioasidfd, device_label)
>
> and let qemu sort it out based on command flags, detection, whatever.
>
> 'ioasidfd, device_label' is the security proof that Alex asked
> for. This needs to be some device in the ioasidfd that declares it is
> capabale of no-snoop. Eg vfio_pci would always declare it is capable
> of no-snoop.
>
> No kernel call backs, no kernel auto-sync/etc. If qemu mismatches the
> IOASID block no-snoop flag with the KVM_x_INCOHERENT_DMA state then it
> is just a kernel-harmless uerspace bug.
>
> Then user space can decide which of the various axis's it wants to
> optimize for.

Let's make sure the KVM folks are part of this decision; a re-cap for
them, KVM currently automatically enables wbinvd emulation when
potentially non-coherent devices are present which is determined solely
based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
to essentially force no-snoop transactions from a device to be cache
coherent. This synchronization is triggered via the kvm-vfio device,
where QEMU creates the device and adds/removes vfio group fd
descriptors as an additionally layer to prevent the user from enabling
wbinvd emulation on a whim.

IIRC, this latter association was considered a security/DoS issue to
prevent a malicious guest/userspace from creating a disproportionate
system load.

Where would KVM stand on allowing more direct userspace control of
wbinvd behavior? Would arbitrary control be acceptable or should we
continue to require it only in association to a device requiring it for
correct operation.

A wrinkle in "correct operation" is that while the IOMMU may be able to
force no-snoop transactions to be coherent, in the scenario described
in the previous reply, the user may intend to use non-coherent DMA
regardless of the IOMMU capabilities due to their own optimization
policy. There's a whole spectrum here, including aspects we can't
determine around the device driver's intentions to use non-coherent
transactions, the user's policy in trading hypervisor overhead for
cache coherence overhead, etc. Thanks,

Alex