Re: [PATCH] iommu/vt-d: Check for non-NULL domain on device release

From: Eric Badger
Date: Tue Jan 16 2024 - 18:16:13 EST


On Tue, Jan 16, 2024 at 11:22:15AM -0400, Jason Gunthorpe wrote:
> On Sat, Jan 13, 2024 at 10:17:13AM -0800, Eric Badger wrote:
> > In the kdump kernel, the IOMMU will operate in deferred_attach mode. In
> > this mode, info->domain may not yet be assigned by the time the
> > release_device function is called, which leads to the following crash in
> > the crashkernel:
>
> This never worked right? But why are you getting to release in a crash
> kernel in the first place, that seems kinda weird..

I first encountered this crash as part of an upgrade from 5.15.125 to
6.6.0. I'd guess this has not worked since p2sb_bar() was plumbed into
i801_probe(). Prior to that, there was similar code in i801_probe(), but
no call to pci_stop_and_remove_bus_device().

>
> > BUG: kernel NULL pointer dereference, address: 000000000000003c
> > ...
> > RIP: 0010:do_raw_spin_lock+0xa/0xa0
> > ...
> > _raw_spin_lock_irqsave+0x1b/0x30
> > intel_iommu_release_device+0x96/0x170
> > iommu_deinit_device+0x39/0xf0
> > __iommu_group_remove_device+0xa0/0xd0
> > iommu_bus_notifier+0x55/0xb0
> > notifier_call_chain+0x5a/0xd0
> > blocking_notifier_call_chain+0x41/0x60
> > bus_notify+0x34/0x50
> > device_del+0x269/0x3d0
> > pci_remove_bus_device+0x77/0x100
> > p2sb_bar+0xae/0x1d0
> > ...
> > i801_probe+0x423/0x740
> >
> > Signed-off-by: Eric Badger <ebadger@xxxxxxxxxxxxxxx>
>
> It should have a fixes line, but I'm not sure what it is..

Starting from my known-working 5.15.125 base, I worked my way forward to
try to find the offenders. I think it's mainly these two:

dd8a25c557e1 ("iommu: Remove deferred attach check from __iommu_detach_device()")
586081d3f6b1 ("iommu/vt-d: Remove DEFER_DEVICE_DOMAIN_INFO")

Prior to the first one, we always end up doing a non-deferred attach on
the device, so it's not possible to have a not-yet-attached device at
release time.

The second one looks to be necessary to get the situation where we have
non-NULL info, but NULL info->domain.

Since there was a lot of refactoring done between 5.15.125 and 6.6.0, I
couldn't just cherry pick those two back to 5.15.125. But I was able to
cherry pick these two sequences with minimal manual massaging and that
reproed the crash in the crashkernel in intel_iommu_release_device() on
5.15.125:

a688b376fa5e iommu/vt-d: Remove DEFER_DEVICE_DOMAIN_INFO
c2ed7ea06b6b iommu/vt-d: Remove domain and devinfo mempool
d195dd05b950 iommu/vt-d: Remove iova_cache_get/put()
0822172fa474 iommu/vt-d: Remove finding domain in dmar_insert_one_dev_info()
443d995073fb iommu/vt-d: Remove intel_iommu::domains

f394cb40eaa7 iommu: Remove deferred attach check from __iommu_detach_device()
09d782454860 iommu: Remove unused argument in is_attach_deferred
35c672b9e7ed iommu: Add DMA ownership management interfaces
6b9638d70dff iommu: Use right way to retrieve iommu_ops

>
> > ---
> > drivers/iommu/intel/iommu.c | 10 ++++++----
> > 1 file changed, 6 insertions(+), 4 deletions(-)
>
> Unfortunately this issue is likely systemic in all the drivers.
>
> Something I've been thinking about for a while now is to have the
> option for the core code release to set the driver to a specific
> releasing domain (probably a blocking or identity global static)
>
> A lot of drivers are open coding this internally in their release
> functions, like Intel. But it really doesn't deserve to be special.
>
> Obviously if we structure things like that then this problem goes away
> too. It looks to me like domain_context_clear_one() is doing BLOCKED
> to me..
>
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 6fb5f6fceea1..26e450259889 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -3750,7 +3750,6 @@ static void domain_context_clear(struct device_domain_info *info)
> > static void dmar_remove_one_dev_info(struct device *dev)
> > {
> > struct device_domain_info *info = dev_iommu_priv_get(dev);
> > - struct dmar_domain *domain = info->domain;
> > struct intel_iommu *iommu = info->iommu;
> > unsigned long flags;
> >
> > @@ -3763,11 +3762,14 @@ static void dmar_remove_one_dev_info(struct device *dev)
> > domain_context_clear(info);
> > }
> >
> > - spin_lock_irqsave(&domain->lock, flags);
> > + if (!info->domain)
> > + return;
> > +
> > + spin_lock_irqsave(&info->domain->lock, flags);
> > list_del(&info->link);
> > - spin_unlock_irqrestore(&domain->lock, flags);
> > + spin_unlock_irqrestore(&info->domain->lock, flags);
> >
> > - domain_detach_iommu(domain, iommu);
> > + domain_detach_iommu(info->domain, iommu);
> > info->domain = NULL;
> > }
>
> This function is almost a copy of device_block_translation(), with
> some bugs added :\
>
> Lu can they be made the same? It seems to me BLOCKED could always
> clear the domain context, even in scalable mode?
>
> Anyhow, my suggestion is more like:
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 15699fe5418e3d..6c683fd4bc05ec 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -3747,30 +3747,6 @@ static void domain_context_clear(struct device_domain_info *info)
> &domain_context_clear_one_cb, info);
> }
>
> -static void dmar_remove_one_dev_info(struct device *dev)
> -{
> - struct device_domain_info *info = dev_iommu_priv_get(dev);
> - struct dmar_domain *domain = info->domain;
> - struct intel_iommu *iommu = info->iommu;
> - unsigned long flags;
> -
> - if (!dev_is_real_dma_subdevice(info->dev)) {
> - if (dev_is_pci(info->dev) && sm_supported(iommu))
> - intel_pasid_tear_down_entry(iommu, info->dev,
> - IOMMU_NO_PASID, false);
> -
> - iommu_disable_pci_caps(info);
> - domain_context_clear(info);
> - }
> -
> - spin_lock_irqsave(&domain->lock, flags);
> - list_del(&info->link);
> - spin_unlock_irqrestore(&domain->lock, flags);
> -
> - domain_detach_iommu(domain, iommu);
> - info->domain = NULL;
> -}
> -
> /*
> * Clear the page table pointer in context or pasid table entries so that
> * all DMA requests without PASID from the device are blocked. If the page
> @@ -4284,7 +4260,6 @@ static void intel_iommu_release_device(struct device *dev)
> {
> struct device_domain_info *info = dev_iommu_priv_get(dev);
>
> - dmar_remove_one_dev_info(dev);
> intel_pasid_free_table(dev);
> intel_iommu_debugfs_remove_dev(info);
> kfree(info);
> @@ -4743,6 +4718,7 @@ static const struct iommu_dirty_ops intel_dirty_ops = {
>
> const struct iommu_ops intel_iommu_ops = {
> .blocked_domain = &blocking_domain,
> + .release_domain = &blocking_domain,
> .capable = intel_iommu_capable,
> .hw_info = intel_iommu_hw_info,
> .domain_alloc = intel_iommu_domain_alloc,
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 068f6508e57c9b..9fbbd83a82d4e9 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -442,6 +442,8 @@ static int iommu_init_device(struct device *dev, const struct iommu_ops *ops)
> err_unlink:
> iommu_device_unlink(iommu_dev, dev);
> err_release:
> + if (ops->release_domain)
> + ops->release_domain->ops->attach_dev(ops->release_domain, dev);
> if (ops->release_device)
> ops->release_device(dev);
> err_module_put:
> @@ -461,6 +463,15 @@ static void iommu_deinit_device(struct device *dev)
>
> iommu_device_unlink(dev->iommu->iommu_dev, dev);
>
> + /*
> + * If the iommu driver provides release_domain then the core code
> + * ensures that domain is attached prior to calling release_device.
> + * Drivers can use this to enforce a translation on the idle iommu.
> + * Usually the global static blocked_domain is a good choice.
> + */
> + if (ops->release_domain)
> + ops->release_domain->ops->attach_dev(ops->release_domain, dev);
> +
> /*
> * release_device() must stop using any attached domain on the device.
> * If there are still other devices in the group they are not effected
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index c34d0ef1703d68..1e8be165c878d8 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -511,6 +511,7 @@ struct iommu_ops {
> struct module *owner;
> struct iommu_domain *identity_domain;
> struct iommu_domain *blocked_domain;
> + struct iommu_domain *release_domain;
> struct iommu_domain *default_domain;
> };
>

FWIW, I took this patch for a spin, appears to work fine.

Thanks,
Eric