RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

From: Tian, Kevin
Date: Thu Oct 21 2021 - 23:08:49 EST


> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Friday, October 22, 2021 7:31 AM
>
> On Thu, Oct 21, 2021 at 02:26:00AM +0000, Tian, Kevin wrote:
>
> > But in reality only Intel integrated GPUs have this special no-snoop
> > trick (fixed knowledge), with a dedicated IOMMU which doesn't
> > support enforce-snoop format at all. In this case there is no choice
> > that the user can further make.
>
> huh? That is not true at all. no-snoop is a PCIe spec behavior, any
> device can trigger it

yes, I should say Intel GPU 'drivers'.

>
> What is true today is that only Intel GPU drivers are crazy enough to
> use it on Linux without platform support.
>
> > Also per Christoph's comment no-snoop is not an encouraged
> > usage overall.
>
> I wouldn't say that, I think Christoph said using it without API
> support through the DMA layer is very wrong.

ok, sounds like I drew out a wrong impression from previous discussion.

>
> DMA layer support could be added if there was interest, all the pieces
> are there to do it.
>
> > Given that I wonder whether the current vfio model better suites for
> > this corner case, i.e. just let the kernel to handle instead of
> > exposing it in uAPI. The simple policy (as vfio does) is to
> > automatically set enforce-snoop when the target IOMMU supports it,
> > otherwise enable vfio/kvm contract to handle no-snoop requirement.
>
> IMHO you need to model it as the KVM people said - if KVM can execute
> a real wbinvd in a VM then an ioctl shoudl be available to normal
> userspace to run the same instruction.
>
> So, figure out some rules to add a wbinvd ioctl to iommufd that makes
> some kind of sense and logically kvm is just triggering that ioctl,
> including whatever security model protects it.

wbinvd instruction is x86 specific. Here we'd want a generic cache
invalidation ioctl and then need some form of arch callbacks though x86
is the only concerned platform for now.

>
> I have no idea what security model makes sense for wbinvd, that is the
> major question you have to answer.

wbinvd flushes the entire cache in local cpu. It's more a performance
isolation problem but nothing can prevent it once the user is allowed
to call this ioctl. This is the main reason why wbinvd is a privileged
instruction and is emulated by kvm as a nop unless an assigned device
has no-snoop requirement. alternatively the user may call clflush
which is unprivileged and can invalidate a specific cache line, though
not efficient for flushing a big buffer.

One tricky thing is that the process might be scheduled to different
cpus between writing buffers and calling wbinvd ioctl. Since wbvind
only has local behavior, it requires the ioctl to call wbinvd on all
cpus that this process has previously been scheduled on.

kvm maintains a dirty cpu mask in its preempt notifier (see
kvm_sched_in/out).

Is there any concern if iommufd also follows the same mechanism?
Currently looks preempt notifier is only used by kvm. Not sure whether
there is strong criteria around using it. and this local behavior may
not apply to all platforms (then better hidden behind arch callback?)

>
> And obviously none of this should be hidden behind a private API to
> KVM.
>
> > I don't see any interest in implementing an Intel GPU driver fully
> > in userspace. If just talking about possibility, a separate uAPI can
> > be still introduced to allow the userspace to issue wbinvd as Paolo
> > suggested.
> >
> > One side-effect of doing so is that then we may have to support
> > multiple domains per IOAS when Intel GPU and other devices are
> > attached to the same IOAS.
>
> I think we already said the IOAS should represent a single IO page
> table layout?

yes. I was just talking about tradeoff possibility if the aforementioned
option is feasible. Now based on above discussion then we will resume
back to this one-ioas-one-layout model.

>
> So if there is a new for incompatible layouts then the IOAS should be
> duplicated.
>
> Otherwise, I also think the iommu core code should eventually learn to
> share the io page table across HW instances. Eg ARM has a similar
> efficiency issue if there are multiple SMMU HW blocks.
>

or we may introduce an alias ioas concept that any change on one
ioas is automatically replayed on the alias ioas if two ioas's are created
just due to incompatible layout.

Thanks
Kevin