Re: [PATCH 2/8] bus: fsl-mc: handle DMA config deferral in ACPI case

From: Daniel Thompson
Date: Wed Nov 17 2021 - 12:00:21 EST


On Wed, Nov 17, 2021 at 05:30:32PM +0200, Laurentiu Tudor wrote:
> On 11/17/2021 3:59 PM, Daniel Thompson wrote:
> > On Wed, Nov 17, 2021 at 03:07:51PM +0200, Laurentiu Tudor wrote:
> >> On 11/12/2021 7:31 PM, Daniel Thompson wrote:
> >>> On Thu, Nov 11, 2021 at 06:36:58PM +0100, Jon Nettleton wrote:
> >>>> On Thu, Nov 11, 2021 at 6:23 PM Daniel Thompson
> >>>> <daniel.thompson@xxxxxxxxxx> wrote:
> >>>> The correct solution for the problem you are seeing is the ACPI
> >>>> maintainers figuring out how to land the IORT RMR patchset. Until
> >>>> that is done the only workaround is setting "arm-smmu.disable_bypass=0
> >>>> iommu.passthrough=1" on the kernel commandline. The latter option is
> >>>> required since 5.15 and I haven't had time or energy to figure out
> >>>> why. The proper solution is to just land the IORT RMR patchset and
> >>>> let HoneyComb run with the SMMU enabled.
> >>>
> >>> Thanks for the update. I'll probably adopt iommu.passthrough=1 for now.
> >>> That allows me to adopt a distro kernel when it updates to v5.15.
> >>
> >> The "iommu.passthrough=1" kernel arg shouldn't be needed. By chance, do
> >> you remember what errors were you seeing? What was failing?
> >
> > For all testing of v5.15 I had "arm-smmu.disable_bypass=0" set because I
> > was guided to enable that by the error messages in older kernels ;-) .
> >
> > Anyhow without "iommu.passthrough=1" (and without the patch from this thread
> > reverted) then the logs are being massively spammed with error messages:
> >
> > ~~~
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> > arm_smmu_context_fault: 1697259 callbacks suppressed
> > ~~~
> >
> > This results a relatively simple workstation (LX2 + nVidia GT-710 + USB
> > for networking) becoming unresponsive. How long to fail is a little
> > unpredictable. I assumed that the weight of such dense log messages
> > eventually gets into a timing pattern that prevented any useful
> > interrupts from being serviced... but that is only a guess.
> >
>
> Few comments here:
> - I'm suspecting that the PCI video card is triggering the smmu faults.
> Would it be possible to give it a try with the card out and without
> "iommu.passthrough=1"?

The PCIe video card does not cause the smmu faults. These still manifest
when the card is removed (and with same IOVA).


> - the IOVAs look weird to me, they should look something like
> 0xffffxxxxxx or so. Maybe there are issues in the nvidia driver?

I guess there could be, but why would a problem that bisects down to
a change in the fsl-mc-bus initialization configuration alter the
behaviour of the PCIe graphics driver?


> - Would it be possible to share a full boot log? I'm thinking that it
> would be interesting to see how the devices are allocated in iommu groups.

See
https://gist.github.com/daniel-thompson/07489561f14965fd1af7d5bd4340f54b

It contains three files, all gathered with the GPU removed:

* Logs from unmodified v5.15 with iommu.passthrough=1 set
(networking is good).
* Logs from v5.15 patched with the revert I shared earlier in
the thread (networking is good).
* Logs from v5.15 without iommu.passthough=1 set (many SMMU messages,
networking is broken).


Daniel.