Re: [PATCH 2/8] bus: fsl-mc: handle DMA config deferral in ACPI case

From: Laurentiu Tudor
Date: Thu Nov 18 2021 - 07:42:02 EST




On 11/17/2021 7:00 PM, Daniel Thompson wrote:
> On Wed, Nov 17, 2021 at 05:30:32PM +0200, Laurentiu Tudor wrote:
>> On 11/17/2021 3:59 PM, Daniel Thompson wrote:
>>> On Wed, Nov 17, 2021 at 03:07:51PM +0200, Laurentiu Tudor wrote:
>>>> On 11/12/2021 7:31 PM, Daniel Thompson wrote:
>>>>> On Thu, Nov 11, 2021 at 06:36:58PM +0100, Jon Nettleton wrote:
>>>>>> On Thu, Nov 11, 2021 at 6:23 PM Daniel Thompson
>>>>>> <daniel.thompson@xxxxxxxxxx> wrote:
>>>>>> The correct solution for the problem you are seeing is the ACPI
>>>>>> maintainers figuring out how to land the IORT RMR patchset. Until
>>>>>> that is done the only workaround is setting "arm-smmu.disable_bypass=0
>>>>>> iommu.passthrough=1" on the kernel commandline. The latter option is
>>>>>> required since 5.15 and I haven't had time or energy to figure out
>>>>>> why. The proper solution is to just land the IORT RMR patchset and
>>>>>> let HoneyComb run with the SMMU enabled.
>>>>>
>>>>> Thanks for the update. I'll probably adopt iommu.passthrough=1 for now.
>>>>> That allows me to adopt a distro kernel when it updates to v5.15.
>>>>
>>>> The "iommu.passthrough=1" kernel arg shouldn't be needed. By chance, do
>>>> you remember what errors were you seeing? What was failing?
>>>
>>> For all testing of v5.15 I had "arm-smmu.disable_bypass=0" set because I
>>> was guided to enable that by the error messages in older kernels ;-) .
>>>
>>> Anyhow without "iommu.passthrough=1" (and without the patch from this thread
>>> reverted) then the logs are being massively spammed with error messages:
>>>
>>> ~~~
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
>>> arm_smmu_context_fault: 1697259 callbacks suppressed
>>> ~~~
>>>
>>> This results a relatively simple workstation (LX2 + nVidia GT-710 + USB
>>> for networking) becoming unresponsive. How long to fail is a little
>>> unpredictable. I assumed that the weight of such dense log messages
>>> eventually gets into a timing pattern that prevented any useful
>>> interrupts from being serviced... but that is only a guess.
>>>
>>
>> Few comments here:
>> - I'm suspecting that the PCI video card is triggering the smmu faults.
>> Would it be possible to give it a try with the card out and without
>> "iommu.passthrough=1"?
>
> The PCIe video card does not cause the smmu faults. These still manifest
> when the card is removed (and with same IOVA).
>
>
>> - the IOVAs look weird to me, they should look something like
>> 0xffffxxxxxx or so. Maybe there are issues in the nvidia driver?
>
> I guess there could be, but why would a problem that bisects down to
> a change in the fsl-mc-bus initialization configuration alter the
> behaviour of the PCIe graphics driver?
>
>
>> - Would it be possible to share a full boot log? I'm thinking that it
>> would be interesting to see how the devices are allocated in iommu groups.
>
> See
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fdaniel-thompson%2F07489561f14965fd1af7d5bd4340f54b&amp;data=04%7C01%7Claurentiu.tudor%40nxp.com%7Cea1a5bd1614a4fc6c71f08d9a9ebbb15%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637727652186934191%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=gYsxsm20NsCKKbSXWPentLAJJPAn6A9hEh3fAKBn2Kw%3D&amp;reserved=0
>
> It contains three files, all gathered with the GPU removed:
>
> * Logs from unmodified v5.15 with iommu.passthrough=1 set
> (networking is good).
> * Logs from v5.15 patched with the revert I shared earlier in
> the thread (networking is good).
> * Logs from v5.15 without iommu.passthough=1 set (many SMMU messages,
> networking is broken).
>

Ok, it appears there was some confusion on my side, sorry about it.
So, to summarize:
- the "arm-smmu.disable_bypass=0" workaround is not enough in the ACPI
scenario but works for DT based boot
- the result of reverting the patch is that the IOMMU for MC is no
longer configured (MC device does not get configured in SMMU) leading to
"arm-smmu.disable_bypass=0" being sufficient
- for ACPI too boot without "iommu.passthrough=1" the IORT RMR patches
are required

---
Best Regards, Laurentiu