Re: [PATCH v4 4/7] iommu: Switch __iommu_domain_alloc() to device ops

From: Robin Murphy
Date: Mon Oct 02 2023 - 15:02:35 EST


On 02/10/2023 3:16 pm, Jason Gunthorpe wrote:
On Mon, Oct 02, 2023 at 02:49:12PM +0100, Robin Murphy wrote:
@@ -2120,20 +2120,30 @@ static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops,
return domain;
}
-static struct iommu_domain *
-__iommu_group_domain_alloc(struct iommu_group *group, unsigned int type)
+static int __iommu_domain_alloc_dev(struct device *dev, void *data)
{

Why?

Because out of 3 callsites, two were in a context which now needed to
make the iommu_group_first_dev() call itself anyway, thus recalculating
it twice more with the mutex still held would clearly be silly, and so
rather than bother refactoring the __iommu_group_domain_alloc() helper I
squashed it into its one remaining user. I don't know a good way to tell
git-format-patch to show a particular range of lines as a complete
removal of one function plus the addition of an entirely unrelated new
one which just happens to be in the same place.

The point of this design is that drivers are not allowed to
allocate different things for devices in the same group. So we always
force the driver to see only the first device in the group even if we
have a more specific device available in the call chain.

OK, but I can't read your mind. All I have visibility of is some
code which factored out a helper function for a sequence common to
several users, as is typical; neither the commit message of 8359cf39acba
nor the cover letter from that series provide any obvious explanation of
this "design".

This patch has undone this design and passed in more specific devs :(

Um... Good? I mean in 3/4 cases it's literally the exact same code just
factored out again, while the one I've added picks some arbitrary device
in a different way. But the first device in the group list is still just
that - some arbitrary device - it's by no means guaranteed to be the
*same* device each time the group is re-locked, so pretending it's some
kind of useful invariant is misleading and wrong. Frankly now that you
*have* explained the design intent here, it's only made me more set on
removing it for being unclear, overcomplicated, and yet fundamentally
useless.

Yes, we absolutely expect that if dev_iommu_ops(dev)->device_group for
two devices returns the same group, the driver should bloody well give
the same result for either dev_iommu_ops(dev)->domain_alloc, but there's
no practical way to enforce that at this level of code. If a driver ever
did have inconsistent behaviour across devices within a group, trying to
always use the "one true device" would only help *hide* that and make it
harder to debug, not at all prevent it.

The new code here:

struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
{
- if (bus == NULL || bus->iommu_ops == NULL)
+ struct device *dev = NULL;
+
+ /* We always check the whole bus, so the return value isn't useful */
+ bus_for_each_dev(bus, NULL, &dev, __iommu_domain_alloc_dev);
+ if (!dev)
return NULL;
- return __iommu_domain_alloc(bus->iommu_ops, NULL,
- IOMMU_DOMAIN_UNMANAGED);
+
+ return __iommu_domain_alloc(dev, IOMMU_DOMAIN_UNMANAGED);
}
EXPORT_SYMBOL_GPL(iommu_domain_alloc);

Should just obtain any group for the bus and pass that to
__iommu_group_domain_alloc().

Also, how does the locking work here? Definately can't pass dev
outside the bus_for_each_dev() like this.

It works like in most of the rest of the API right now... this is still
the process of cleaning up the old ugly stuff in order to be able to
evolve the API foundations to the point where we *can* have a reasonable
locking scheme. It's a multi-step process, and I'd really like to get
the internal bus-ops-to-instance-ops transition completed in order to
then clean up the "of_iommu_configure() at driver probe time"
catastrophe, get iommu_domain_alloc() callers to pass a valid device
themselves, teach the ARM DMA ops about default domains (for which it's
a great help that you've now got the driver-side groundwork done,
thanks!), and then maybe we can finally have nice things.

If this needs to sweep over arbitary devices that are not the caller's
probe'd device it needs to hold at least the device_lock to prevent
racing with release.

So I'd structure this to find the matching device, lock the
device_lock, get the group refcount, unlock the device_lock then
get the group_mutex, check for non-empty and then call
__iommu_group_domain_alloc()

...and as your reward you'd get much the same deadlocks as with the last
attempt to bring device_lock into it, thanks to
arm_iommu_create_mapping(), and drivers calling iommu_domain_alloc()
from their own probe routines :(

FYI the diff below is what I've hacked out so far, but I'll test it with
a fresh head tomorrow (just pasting it in here I've noticed at least one
bug...)

(there is a missing lockdep annotation in
__iommu_group_domain_alloc(), the group mutex is needed)

(fear not, iommu_group_first_dev() brings that to those places!)

Thanks,
Robin.

----->8-----
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3d29434d57c6..78366e988e31 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2128,27 +2128,41 @@ static struct iommu_domain *__iommu_domain_alloc(struct device *dev,
static int __iommu_domain_alloc_dev(struct device *dev, void *data)
{
struct device **alloc_dev = data;
+ struct iommu_group *group = iommu_group_get(dev);
- if (!dev_has_iommu(dev))
+ if (!group)
return 0;
- WARN_ONCE(*alloc_dev && dev_iommu_ops(dev) != dev_iommu_ops(*alloc_dev),
- "Multiple IOMMU drivers present, which the public IOMMU API can't fully support yet. You may still need to disable one or more to get the expected result here, sorry!\n");
+ mutex_lock(&group->mutex);
+ /* Theoretically we could have raced against release */
+ if (list_empty(&group->devices)) {
+ mutex_unlock(&group->mutex);
+ iommu_group_put(group);
+ return 0;
+ }
- *alloc_dev = dev;
+ if (!*alloc_dev)
+ *alloc_dev = dev;
+
+ WARN_ONCE(dev_iommu_ops(dev) != dev_iommu_ops(*alloc_dev),
+ "Multiple IOMMU drivers present, which the public IOMMU API can't fully support yet. You may still need to disable one or more to get the expected result here, sorry!\n");
return 0;
}
struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
{
struct device *dev = NULL;
+ struct iommu_domain *dom;
/* We always check the whole bus, so the return value isn't useful */
bus_for_each_dev(bus, NULL, &dev, __iommu_domain_alloc_dev);
if (!dev)
return NULL;
- return __iommu_domain_alloc(dev, IOMMU_DOMAIN_UNMANAGED);
+ dom = __iommu_domain_alloc(dev, IOMMU_DOMAIN_UNMANAGED);
+ mutex_unlock(&dev->iommu_group->mutex);
+ iommu_group_put(dev->iommu_group);
+ return dom;
}
EXPORT_SYMBOL_GPL(iommu_domain_alloc);