Re: [PATCH 1/4] iommu: Add iommu_device_group callback andiommu_group sysfs entry

From: Alex Williamson
Date: Thu Dec 01 2011 - 17:38:07 EST


On Fri, 2011-12-02 at 08:46 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2011-12-01 at 07:34 -0700, Alex Williamson wrote:
>
> > We've got multiple levels when we add qemu and guests into the mix. A
> > group is the smallest assignable unit for kernel->userspace. In fact,
> > vfio is constructed so that the user cannot do anything with a group
> > until all the devices of the group are bound to the vfio bus drivers.
> > Qemu, as a userspace driver, must therefore take ownership of the entire
> > group. However, there's no requirement that a userspace driver must
> > make use of all the devices in the group, so qemu is free to expose
> > individual devices from the group to the guest.
>
> Sure but that has nothing to do with your kernel->user API. As you said,
> you still need to take "ownership" of the entire group. If qemu chose to
> only present to the guest part of that group, it's qemu's problem.
>
> Put it differently. One day, maybe, we'll finally get some clues and
> implement a proper way for admin tools to know what can or cannot be put
> into a guest. In any case, how would you present to the user
> (administrator) the groups ? By individual devices with odd behaviours /
> errors etc... when the said user/admin tries to put individual devices
> into different partitions while they belong in the same group ? Or by
> just exposing ... groups ?

Why are we even talking about this here? We both agree on the
kernel->user ownership model and it's qemu's decision what to do from
there. x86 will typically have groups with a single device. When
that's not the case, it really doesn't seem that hard for an admin tool
to notify the user and ask them if they want to also assign the
dependent device or leave it unused. The complexity vs "hey, why did
this other device just appear on my guest and disappear from my host" or
even "why won't this stupid tool let me separate these two devices"
feels like a wash.

> So while those various tools will want to see what's in the group (to
> represent it to the user, to expose it to the guest, for other finer
> grained operations such as MSI setup, etc...) the basic unit of
> ownership is and remains the group, and I don't see how it makes sense
> to have your underlying iommu interface operate on anything else.

Are we going to tear apart every aspect of the IOMMU API, or can we use
some pieces of it as it exists and incrementally improve it? Yes, if
the IOMMU really can't distinguish between devices in a group, it
probably doesn't matter which of them you attach to a domain. However,
groups of devices aren't a basic unit of work for the rest of the kernel
and even if the iommu device decoders can't tell the difference between
transactions from hardware, the iommu driver might care for accounting
purposes.

> > IMHO, it doesn't make
> > sense to have a default model saying "I know you just wanted the nic,
> > but it's in the same group as this graphics card, so surprise, you get
> > both!". Obviously if a user does want to expose multiple devices from a
> > group to a guest, we support that too.
>
> No you don't get it. That's the wrong way around. The user will be
> presented to start with with a group of nic+graphics card so that user
> knows from the beginning what's going on.
>
> Because your other option is to put the nic in ... and suddenly have the
> graphic card go away from the host without showing up in a guest. That
> won't be any better than having it "just show up" in the guest without
> asking for it.
>
> IE. In both cases, it's wrong. The only thing that makes sense is from
> the beginning, expose those two as a group so that the user has no other
> choice but put them both into the guest at once and have it represented
> to the user as such to begin with.

People have very different ideas about what's intuitive and how
interfaces should work, just look at gnome3/unity versus the useful
desktops we used to have. Neither of us are likely to write the piece
of management code that does this and the design decisions we're making
here enable either usage model.

> > Spitting groups among multiple VMs or between VM and native host drivers
> > defeats the purpose of the group. Neither of these are allowed.
> >
> > > Btw, did we get a quirk for the Ricoh multi-function devices which all
> > > need to be in the same group because they do all their DMA from function
> > > zero? I think we need another similar quirk for a Marvell SATA
> > > controller which seems to do its AHCI DMA from its IDE function; see
> > > https://bugzilla.redhat.com/757166
> >
> > No, as I mentioned, groups are currently for iommu_ops, not dma_ops,
> > though it makes sense that iommu drivers could use the group info or
> > create common quirk infrastructure for handling broken devices like
> > these. Thanks,
>
> Which is why the group info in iommu_ops should just be a representation
> of something else under the hood that needs to be sorted out more
> clearly than what we are doing at the moment.

I will be happy to review your patches to transform groups from an
identifier into an object and make them the pervasive unit of work in
the iommu layer. The above problem can be solved by:

struct pci_dev *pci_quirk_dma_dev(struct pci_dev *pdev)
{
if (some match code)
return lookup different device

return pdev
}

Which would get called by both the dma_ops code and iommu_device_group.
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/