Re: [PATCH 1/4] iommu: Add iommu_device_group callback andiommu_group sysfs entry

From: Alex Williamson
Date: Thu Dec 01 2011 - 02:28:40 EST


On Wed, 2011-11-30 at 18:05 -0800, Chris Wright wrote:
> * Benjamin Herrenschmidt (benh@xxxxxxxxxxxxxxxxxxx) wrote:
> > On Wed, 2011-11-30 at 17:04 -0800, Chris Wright wrote:
> > > Heh. Put it another way. Generating the group ID is left up to the
> > > IOMMU. This will break down when there's a system with multiple IOMMU's
> > > on the same bus_type that don't have any awareness of one another. This
> > > is not the case for the existing series and x86 hw.
> > >
> > > I'm not opposed to doing the allocation and ptr as id (taking care for
> > > possibility that PCI hotplug/unplug/replug could reuse the same memory
> > > for group id, however). Just pointing out that the current system works
> > > as is, and there's some value in it's simplicity (overloading ID ==
> > > group structure + pretty printing ID in sysfs, for example).
> >
> > Well, ID can work even with multiple domains since we have domains
> > numbers. bdfn is 16-bit, which leaves 16-bit for the domain number,
> > which is sufficient.
> >
> > So by encoding (domain << 16) | bdfn, we can get away with a 32-bit
> > number... it just sucks.
>
> Yup, that's just what Alex did for VT-d ;)
>
> + union {
> + struct {
> + u8 devfn;
> + u8 bus;
> + u16 segment;
> + } pci;
> + u32 group;
> + } id;
>
> Just that the alias table used for AMD IOMMU to map bdf -> requestor ID
> is not multi-segment aware, so the id is only bdf of bridge.
>
> > Note that on pseries, I wouldn't use bdfn anyway, I would use my
> > internal "PE#" which is also a number that I can constraint to 16-bits.
> >
> > So I can work with a number as long as it's at least an unsigned int
> > (32-bit), but I think it somewhat sucks, and will impose gratuituous
> > number <-> structure conversions all over, but if we keep the whole
> > group thing an iommu specific data structure, then let's stick to the
> > number and move on with life.

I think you're over emphasizing these number <-> struct conversions.
Nowhere in the iommu/driver core code is there a need for this. It's a
simple struct device -> groupid translation. Even in the vfio code we
only need to do lookups based on groupid rarely and never in anything
resembling a performance path.

> > We might get better results if we kept the number as
> >
> > struct iommu_group_id {
> > u16 domain;
> > u16 group;
> > };
> >
> > (Or a union of that with an unsigned int)
> >
> > That way the domain information is available generically (can be match
> > with pci_domain_nr() for example), and sysfs can then be layed out as
> >
> > /sys/bus/pci/groups/<domain>/<id>
> >
> > Which is nicer than having enormous id's
>
> Seems fine to me (although I missed /sys/bus/pci/groups/ introduction),
> except that I think the freescale folks aren't interested in PCI which
> is one reason why the thing is just an opaque id.

I missed the /sys/bus/pci/groups/ introduction as well. Groupids are
not PCI specific. FWIW, the current sysfs representation looks
something like this:

$ cat /sys/bus/pci/devices/0000:01:10.6/iommu_group
384

$ ls -l /sys/devices/virtual/vfio/pci:384
total 0
-r--r--r-- 1 root root 4096 Nov 30 16:25 dev
drwxr-xr-x 2 root root 0 Nov 30 16:25 devices
drwxr-xr-x 2 root root 0 Nov 30 16:25 power
lrwxrwxrwx 1 root root 0 Nov 30 16:24 subsystem -> ../../../../class/vfio
-rw-r--r-- 1 root root 4096 Nov 30 16:24 uevent

$ cat /sys/devices/virtual/vfio/pci:384/dev
252:28

$ ls -l /dev/vfio/pci:384
crw-rw---- 1 root root 252, 28 Nov 30 16:24 /dev/vfio/pci:384

$ ls -l /sys/devices/virtual/vfio/pci:384/devices
total 0
lrwxrwxrwx 1 root root 0 Nov 30 16:36 0000:01:10.0 -> ../../../../pci0000:00/0000:00:01.0/0000:01:10.0
lrwxrwxrwx 1 root root 0 Nov 30 16:25 0000:01:10.1 -> ../../../../pci0000:00/0000:00:01.0/0000:01:10.1
lrwxrwxrwx 1 root root 0 Nov 30 16:36 0000:01:10.2 -> ../../../../pci0000:00/0000:00:01.0/0000:01:10.2
lrwxrwxrwx 1 root root 0 Nov 30 16:25 0000:01:10.3 -> ../../../../pci0000:00/0000:00:01.0/0000:01:10.3
lrwxrwxrwx 1 root root 0 Nov 30 16:36 0000:01:10.4 -> ../../../../pci0000:00/0000:00:01.0/0000:01:10.4
lrwxrwxrwx 1 root root 0 Nov 30 16:25 0000:01:10.5 -> ../../../../pci0000:00/0000:00:01.0/0000:01:10.5
lrwxrwxrwx 1 root root 0 Nov 30 16:36 0000:01:10.6 -> ../../../../pci0000:00/0000:00:01.0/0000:01:10.6
lrwxrwxrwx 1 root root 0 Nov 30 16:25 0000:01:10.7 -> ../../../../pci0000:00/0000:00:01.0/0000:01:10.7

$ for i in $(find /sys/devices/virtual/vfio/pci:384/devices/ -type l); do cat $i/iommu_group; echo; done
384
384
384
384
384
384
384
384

I'm a little annoyed with the pci:384 notation, but that seems like how
it works out best for using {bus_type, groupid}. Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/