Re: [PATCH v4 1/2] iommu/s390: Fix race with release_device ops

From: Matthew Rosato
Date: Fri Sep 02 2022 - 14:29:41 EST


On 9/2/22 1:21 PM, Jason Gunthorpe wrote:
> On Fri, Sep 02, 2022 at 01:11:09PM -0400, Matthew Rosato wrote:
>> On 9/1/22 4:37 PM, Jason Gunthorpe wrote:
>>> On Thu, Sep 01, 2022 at 12:14:24PM -0400, Matthew Rosato wrote:
>>>> On 9/1/22 6:25 AM, Robin Murphy wrote:
>>>>> On 2022-08-31 21:12, Matthew Rosato wrote:
>>>>>> With commit fa7e9ecc5e1c ("iommu/s390: Tolerate repeat attach_dev
>>>>>> calls") s390-iommu is supposed to handle dynamic switching between IOMMU
>>>>>> domains and the DMA API handling.  However, this commit does not
>>>>>> sufficiently handle the case where the device is released via a call
>>>>>> to the release_device op as it may occur at the same time as an opposing
>>>>>> attach_dev or detach_dev since the group mutex is not held over
>>>>>> release_device.  This was observed if the device is deconfigured during a
>>>>>> small window during vfio-pci initialization and can result in WARNs and
>>>>>> potential kernel panics.
>>>>>
>>>>> Hmm, the more I think about it, something doesn't sit right about this whole situation... release_device is called via the notifier from device_del() after the device has been removed from its parent bus and largely dismantled; it should definitely not still have a driver bound by that point, so how is VFIO doing things that manage to race at all?
>>>>>
>>>>> Robin.
>>>>
>>>> So, I generally have seen the issue manifest as one of the calls
>>>> into the iommu core from __vfio_group_unset_container
>>>> (e.g. iommu_deatch_group via vfio_type1_iommu) failing with a WARN.
>>>> This happens when the vfio group fd is released, which could be
>>>> coming e.g. from a userspace ioctl VFIO_GROUP_UNSET_CONTAINER.
>>>> AFAICT there's nothing serializing the notion of calling into the
>>>> iommu core here against a device that is simultaneously going
>>>> through release_device (because we don't enter release_device with
>>>> the group mutex held), resulting in unpredictable behavior between
>>>> the dueling attach_dev/detach_dev and the release_device for
>>>> s390-iommu at least.
>>>
>>> Oh, this is a vfio bug.
>>
>> I've been running with your diff applied today on s390 and this
>> indeed fixes the issue by preventing the detach-after-release coming
>> out of vfio.
>
> Heh, I'm shocked it worked at all
>
> I've been trying to understand Robin's latest remarks because maybe I
> don't really understand your situation right.
>
> IMHO this is definately a VFIO bug, because in a single-device group
> we must not allow the domain to remain attached past remove(). Or more
> broadly we shouldn't be holding ownership of a group without also
> having a driver attached.>
> But this dicussion with Robin about multi-device groups and hotplug
> makes me wonder what your situation is? There is certainly something
> interesting there too, and this can't be a solution to that problem.
>

It's just a single-device group, that's all we do in s390 today. Of course, as we move forward to consume s390-iommu for other use-cases (e.g. Niklas DMA conversion) then yeah I think we will still need something within s390-iommu to handle competing e.g. attach/detach and release_device calls.

For this particular issue, the scenario is triggered by deconfiguring that one host device (effectively, pull the plug) while the device is passed to QEMU via vfio-pci.

>> Can you send as a patch for review?
>
> After I wrote this I had a better idea, to avoid the completion and
> just fully orphan the group fd.
>
> And the patch is kind of messy
>
> Can you forward me the backtrace you hit also?

It manifests in a few different ways (depends on the timing of the iommu_detach_group vs release_device), but this is by far the most common:

[ 402.718355] iommu driver failed to attach the default/blocking domain
[ 402.718380] WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
[ 402.718389] Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
[ 402.718439] CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G W 6.0.0-rc3 #5
[ 402.718442] Hardware name: IBM 3931 A01 782 (LPAR)
[ 402.718443] Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
[ 402.718447] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 402.718450] Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
[ 402.718453] 00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
[ 402.718455] 00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
[ 402.718457] 00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
[ 402.718464] Krnl Code: 000000095bb10d18: c020003d56fc larl %r2,000000095c2bbb10
000000095bb10d1e: c0e50019d901 brasl %r14,000000095be4bf20
#000000095bb10d24: af000000 mc 0,0
>000000095bb10d28: b904002a lgr %r2,%r10
000000095bb10d2c: ebaff0a00004 lmg %r10,%r15,160(%r15)
000000095bb10d32: c0f4001aa867 brcl 15,000000095be65e00
000000095bb10d38: c004002168e0 brcl 0,000000095bf3def8
000000095bb10d3e: eb6ff0480024 stmg %r6,%r15,72(%r15)
[ 402.718532] Call Trace:
[ 402.718534] [<000000095bb10d28>] iommu_detach_group+0x70/0x80
[ 402.718538] ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
[ 402.718540] [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
[ 402.718546] [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
[ 402.718552] [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
[ 402.718557] pci 0004:00:00.0: Removing from iommu group 4
[ 402.718555] [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
[ 402.718562] [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
[ 402.718566] [<000000095be6c072>] system_call+0x82/0xb0
[ 402.718570] Last Breaking-Event-Address:
[ 402.718571] [<000000095be4bf80>] __warn_printk+0x60/0x68


The WARN is hit because the attach fails due to the simultaneously in-flight release_device.

The subject patch was basically attempting to serialize the value that was being stomped over within s390-iommu (zdev->s390_domain) as a result of the competing iommu_detach_group + release_device. By ensuring the vfio group is cleaned up before the release, I no longer see the competing threads, with release_device happening after the group is cleaned up.

>
> (Though I'm not sure I can get to this promptly, I have only 4 working
> days before LPC and still many things to do)
>
> Thanks,
> Jason