On 3.11.2023 12:55, Hector Martin wrote:
I just hit a crash in of_iommu_xlate() -> apple_dart_of_xlate() becauseFWIW I've been getting inexplicable boot-time crashes that sometimes
dev->iommu was NULL. of_iommu_xlate() first calls iommu_fwspec_init
which calls dev_iommu_get(), which allocates that member if NULL. That
means it got freed in between, but the only thing that can do that is
dev_iommu_free(), which is called from __iommu_probe_device() in the
error path. That is serialized via a static lock, but not against the
xlate stuff.
I think the specific sequence of events was as follows:
- IOMMU driver has not probed yet
- Device driver tries to probe, and gets deferred via of_iommu_xlate()
-> driver_deferred_probe_check_state() because there are no IOMMU ops yet
- IOMMU driver probes
- IOMMU driver registration triggers device probes
- IOMMU device probe fails, because there is no fwnode/OF data yet (e.g.
apple_dart_probe_device returns ENODEV if dev_iommu_priv_get() returns
NULL, and that is set in apple_dart_of_xlate())
- __iommu_probe_device is in the error exit path, and at this exact
point a parallel device probe is running of_iommu_xlate()
- of_iommu_xlate() calls iommu_fwspec_init(), which ensures dev->iommu
is non-NULL, which at this point it is
- immediately after that, __iommu_probe_device() calls dev_iommu_free()
since it is in the process of erroring out. This frees and sets
dev->iommu to NULL.
- of_iommu_xlate() calls ops->of_xlate()
- apple_dart_of_xlate() calls dev_iommu_priv_set(), which crashes
because dev->iommu is now NULL.
As far as I can tell it's not just the specific driver xlate call
setting priv that's the problem here, but there is one big race between
the entire fwspec codepath (accessing dev->iommu->fwspec) and
__iommu_probe_device() (allocating and freeing dev->iommu).
Thinking about this whole thing is making my brain hurt. Thoughts? How
do we fix this?
spew out a fraction of a log line like:
[x.yyyyyyyy] addr.iommu
on some Qualcomm devices every now and then for quite some time..
Not very common though. Might be this, might be something else..