Re: Race between of_iommu_configure() and iommu_probe_device()

From: Konrad Dybcio
Date: Fri Nov 03 2023 - 08:48:39 EST




On 3.11.2023 12:55, Hector Martin wrote:
> I just hit a crash in of_iommu_xlate() -> apple_dart_of_xlate() because
> dev->iommu was NULL. of_iommu_xlate() first calls iommu_fwspec_init
> which calls dev_iommu_get(), which allocates that member if NULL. That
> means it got freed in between, but the only thing that can do that is
> dev_iommu_free(), which is called from __iommu_probe_device() in the
> error path. That is serialized via a static lock, but not against the
> xlate stuff.
>
> I think the specific sequence of events was as follows:
>
> - IOMMU driver has not probed yet
> - Device driver tries to probe, and gets deferred via of_iommu_xlate()
> -> driver_deferred_probe_check_state() because there are no IOMMU ops yet
> - IOMMU driver probes
> - IOMMU driver registration triggers device probes
> - IOMMU device probe fails, because there is no fwnode/OF data yet (e.g.
> apple_dart_probe_device returns ENODEV if dev_iommu_priv_get() returns
> NULL, and that is set in apple_dart_of_xlate())
> - __iommu_probe_device is in the error exit path, and at this exact
> point a parallel device probe is running of_iommu_xlate()
> - of_iommu_xlate() calls iommu_fwspec_init(), which ensures dev->iommu
> is non-NULL, which at this point it is
> - immediately after that, __iommu_probe_device() calls dev_iommu_free()
> since it is in the process of erroring out. This frees and sets
> dev->iommu to NULL.
> - of_iommu_xlate() calls ops->of_xlate()
> - apple_dart_of_xlate() calls dev_iommu_priv_set(), which crashes
> because dev->iommu is now NULL.
>
> As far as I can tell it's not just the specific driver xlate call
> setting priv that's the problem here, but there is one big race between
> the entire fwspec codepath (accessing dev->iommu->fwspec) and
> __iommu_probe_device() (allocating and freeing dev->iommu).
>
> Thinking about this whole thing is making my brain hurt. Thoughts? How
> do we fix this?
FWIW I've been getting inexplicable boot-time crashes that sometimes
spew out a fraction of a log line like:

[x.yyyyyyyy] addr.iommu

on some Qualcomm devices every now and then for quite some time..
Not very common though. Might be this, might be something else..

Konrad