[PROBLEM] crashkernel gets stuck at DMAR-IR: Copied IR table for dmar1 from previous kernel

From: Matthew Ruffell
Date: Sun Nov 14 2021 - 23:33:13 EST


Dear IOMMU Subsystem Maintainers,

I have been debugging an issue with Nathan Langford, CC here, for some months
now, along with Alex Williamson on the linux-pci mailing list, and I just wanted
to check that we aren't also running into an IOMMU bug when enabling IRQ
remapping in the crashkernel.

Nathan has a system with 8x 2080TI graphics cards, and we are passing through
multiple GPUs to a KVM VM via vfio-pci. When we pass through 2x GPUs that share
the same upstream PCI switch, and reboot the VM a handful of times, an IRQ storm
occurs, and locks up the host system.

System Information:
- SuperMicro X9DRG-O(T)F
- 8x Nvidia GeForce RTX 2080 Ti GPUs
- Ubuntu 20.04 LTS
- 5.14.0 mainline kernel
- libvirt 6.0.0-0ubuntu8.10
- qemu 4.2-3ubuntu6.16
- intel_iommu=on

In the logs we see:

irq 31: nobody cared (try booting with the "irqpoll" option)
Call Trace:
<IRQ>
dump_stack_lvl+0x4a/0x5f
dump_stack+0x10/0x12
__report_bad_irq+0x3a/0xaf
note_interrupt.cold+0xb/0x60
handle_irq_event_percpu+0x72/0x80
handle_irq_event+0x3b/0x60
handle_fasteoi_irq+0x9c/0x150
__common_interrupt+0x4b/0xb0
common_interrupt+0x4a/0xa0
asm_common_interrupt+0x1e/0x40
RIP: 0010:__do_softirq+0x73/0x2ae
handlers:
[<00000000b16da31d>] vfio_intx_handler
Disabling IRQ #31

Extra details on LKML / linux-pci:
https://lkml.org/lkml/2021/9/13/85

Now, Nathan has "kernel.hardlockup_panic = 1" set, which causes the kernel to
panic, and reboot to the crashkernel, and this is where the IOMMU issues begin.

The crashkernel loads, and gets as far as:

DMAR: Host address width 46
DMAR: DRHD base: 0x000000fbffe000 flags: 0x0
DMAR: dmar0: reg_base_addr fbffe000 ver 1:0 cap d2078c106f0466 ecap f020de
DMAR: DRHD base: 0x000000cbffc000 flags: 0x1
DMAR: dmar1: reg_base_addr cbffc000 ver 1:0 cap d2078c106f0466 ecap f020de
DMAR: RMRR base: 0x0000005f21a000 end: 0x0000005f228fff
DMAR: ATSR flags: 0x0
DMAR: RHSA base: 0x000000fbffe000 proximity domain: 0x1
DMAR: RHSA base: 0x000000cbffc000 proximity domain: 0x0
DMAR-IR: IOAPIC id 3 under DRHD base 0xfbffe000 IOMMU 0
DMAR-IR: IOAPIC id 0 under DRHD base 0xcbffc000 IOMMU 1
DMAR-IR: IOAPIC id 2 under DRHD base 0xcbffc000 IOMMU 1
DMAR-IR: HPET id 0 under DRHD base 0xcbffc000
[ 3.271530] DMAR-IR: Queued invalidation will be enabled to support
x2apic and Intr-remapping.
[ 3.282572] DMAR-IR: Copied IR table for dmar0 from previous kernel
[ 13.291319] DMAR-IR: Copied IR table for dmar1 from previous kernel

I added the timestamps for the last couple entries. There is a ten second hang
between copying the IR table from dmar0 and copying the IR table from dmar1.

After this, the kernel just hangs, and the system has to be hard rebooted.

Full dmesg:
https://paste.ubuntu.com/p/M7Bdyk9YV7/

We never see the next message that usually happens with plain old sysrq-trigger,
which is:

DMAR-IR: Enabled IRQ remapping in x2apic mode

Would an ongoing IRQ storm prevent IRQ remapping being enabled?

>From my understanding, when we start the crashkernel, PCI devices are in an
undefined state, and could keep on sending DMA or IRQ requests to the
crashkernel, which could break things through data corruption or causing IRQs to
be blocked if we get too many spurious IRQs. This would then cause problems if
we try and re-initialise these PCI devices and they have IRQs blocked.

Which is why we copy the old IR tables from dmar regions, and unblock blocked
IRQs. But if an IRQ storm is ongoing, is there anything we can really do? Is it
a bug to just hang here, or is it an indication that the system administrator
needs to go and do a full hardware reset?

Please let us know if you need any additional debugging information, we can
build patched kernels if you need extra debug output.

Thanks,
Matthew