Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management

From: Yujie Liu
Date: Thu Apr 27 2023 - 21:36:51 EST


Hi Thomas,

On Wed, Apr 26, 2023 at 02:08:54PM +0200, Thomas Gleixner wrote:
> On Tue, Apr 25 2023 at 11:16, kernel test robot wrote:
> > kernel test robot noticed "WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical" on:
> >
> > commit: 13eb5c4e7d2fb860d3dc5f63d910e3acf78dfd28 ("[PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management")
> > url: https://github.com/intel-lab-lkp/linux/commits/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230410-235853
> > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 6f3ee0e22b4c62f44b8fa3c8de6e369a4d112a75
> > patch link: https://lore.kernel.org/all/20230410155721.3720991-4-sdonthineni@xxxxxxxxxx/
> > patch subject: [PATCH v3 3/3] genirq: Use the maple tree for IRQ
> > descriptors management
>
> This happens during CPU hot-unplug.
>
> [ 206.930774][ T228] block/008 => sdb2 (do IO while hotplugging CPUs)
> [ 206.935757][ T2086] run blktests block/008 at 2023-04-22 16:27:25
> [ 207.199359][ T2086] smpboot: CPU 2 is now offline
>
> [ 207.468574][ T30] WARNING: CPU: 3 PID: 30 at arch/x86/kernel/apic/ipi.c:299 default_send_IPI_mask_logical+0x40/0x44
> [ 207.568426][ T30] CPU: 3 PID: 30 Comm: migration/3 Tainted: G S E 6.2.0-rc4-00051-g13eb5c4e7d2f #1
> [ 207.588372][ T30] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0xf5/0x138
> [ 207.596649][ T30] EIP: default_send_IPI_mask_logical+0x40/0x44
>
> This warns because fixup_irqs() sends an IPI to an offline CPU. In this
> case to CPU3 which just cleared its online bit and is about to vanish:
>
> [ 207.622147][ T30] EAX: 00000008 EBX: 00000002 ECX: fffffffc EDX: 00000022
>
> EAX contains the target and ECX the inverted online mask. That's
> probably the ata2 interrupt as that later detects a timeout:
>
> [ 238.826212][ T174] ata2.00: exception Emask 0x0 SAct 0x3c00000 SErr 0x0 action 0x6 frozen
> [ 238.834522][ T174] ata2.00: failed command: READ FPDMA QUEUED
> [ 238.840378][ T174] ata2.00: cmd 60/08:b0:90:3e:90/00:00:25:00:00/40 tag 22 ncq dma 4096 in
> [ 238.840378][ T174] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>
> Which means that migrating the interrupt away from the outgoing CPU3
> failed for yet to understand reasons.
>
> The patch in question is changing the interrupt descriptor storage and
> with that also the iterator function. But I can't spot anything wrong
> right now.
>
> But what I can spot is this:
>
> [ 0.000000][ T0] Linux version 6.2.0-rc4-00051-g13eb5c4e7d2f
>
> IOW, that test is based on some random upstream version, which lacks
> about 30 commits to maple_tree, where 12 of them have 'fix' in the
> commit subject.
>
> Can you please retest this on v6.3 and report back when the problem
> persists?

Thanks for your help looking into this problem.

The problem persists when tested on v6.3, but not 100% reproducible.
We ran the test on v6.3 and v6.3+patch each for 20 runs. There are 9
failed runs on v6.3+patch, while v6.3 is all clean. Full dmesg is
attached.

=========================================================================================
compiler/disk/kconfig/rootfs/tbox_group/test/testcase:
gcc-11/1SSD/i386-debian-10.3-func/debian-11.1-i386-20220923.cgz/lkp-skl-d06/block-group-00/blktests

commit:
v6.3
32c58fc685e5c ("genirq: Use the maple tree for IRQ descriptors management")

v6.3 32c58fc685e5cd6b5947a5f8e9a
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:20 45% 9:20 dmesg.EIP:default_send_IPI_mask_logical
:20 45% 9:20 dmesg.WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical
:20 45% 9:20 blktests.block/008.fail
:20 45% 9:20 blktests.block/012.fail


[ 214.484584][ T235] block/008 => sdb2 (do IO while hotplugging CPUs)
[ 214.484586][ T235]
[ 214.489534][ T2125] run blktests block/008 at 2023-04-27 10:06:38
[ 214.749748][ T2125] smpboot: CPU 3 is now offline
[ 215.009645][ T28] ------------[ cut here ]------------
[ 215.014948][ T28] WARNING: CPU: 2 PID: 28 at arch/x86/kernel/apic/ipi.c:299 default_send_IPI_mask_logical (arch/x86/kernel/apic/ipi.c:299 (discriminator 1))
[ 215.025527][ T28] Modules linked in: loop(E) dm_multipath(E) dm_mod(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) intel_rapl_msr(E) intel_rapl_common(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) sd_mod(E) t10_pi(E) coretemp(E) crc64_rocksoft_generic(E) kvm_intel(E) crc64_rocksoft(E) crc64(E) kvm(E) irqbypass(E) ipmi_devintf(E) ipmi_msghandler(E) crc32_pclmul(E) i915(E) drm_buddy(E) crc32c_intel(E) tpm_infineon(E) drm_display_helper(E) aesni_intel(E) crypto_simd(E) mei_wdt(E) cec(E) tpm_tis(E) cryptd(E) ahci(E) drm_kms_helper(E) psmouse(E) tpm_tis_core(E) xhci_pci(E) mei_me(E) i2c_i801(E) rapl(E) syscopyarea(E) wmi_bmof(E) xhci_hcd(E) usbcore(E) intel_cstate(E) libahci(E) evdev(E) serio_raw(E) i2c_smbus(E) sysfillrect(E) sysimgblt(E) intel_uncore(E) tpm(E) video(E) libata(E) mei(E) intel_pch_thermal(E) ttm(E) ie31200_edac(E) usb_common(E) rng_core(E) wmi(E) acpi_pad(E) button(E) fuse(E) drm(E) configfs(E) autofs4(E) [last unloaded: null_blk(E)]
[ 215.112194][ T28] CPU: 2 PID: 28 Comm: migration/2 Tainted: G S E 6.3.0-00003-g32c58fc685e5 #1
[ 215.122171][ T28] Hardware name: HP HP Z238 Microtower Workstation/8183, BIOS N51 Ver. 01.63 10/05/2017
[ 215.131716][ T28] Stopper: multi_cpu_stop+0x0/0xf4 <- stop_machine_cpuslocked (kernel/stop_machine.c:429 kernel/stop_machine.c:469 kernel/stop_machine.c:619)
[ 215.139966][ T28] EIP: default_send_IPI_mask_logical (arch/x86/kernel/apic/ipi.c:299 (discriminator 1))
[ 215.145961][ T28] Code: 2b c2 f7 d1 85 c1 75 22 b9 00 08 00 00 e8 cc fb ff ff 80 e7 02 74 01 fb 8b 5d fc c9 c3 8d 74 26 00 90 c3 8d b4 26 00 00 00 00 <0f> 0b eb da 3e 8d 74 26 00 a1 64 b8 2b c2 f6 c4 02 75 0d 31 c0 c3
All code
========
0: 2b c2 sub %edx,%eax
2: f7 d1 not %ecx
4: 85 c1 test %eax,%ecx
6: 75 22 jne 0x2a
8: b9 00 08 00 00 mov $0x800,%ecx
d: e8 cc fb ff ff call 0xfffffffffffffbde
12: 80 e7 02 and $0x2,%bh
15: 74 01 je 0x18
17: fb sti
18: 8b 5d fc mov -0x4(%rbp),%ebx
1b: c9 leave
1c: c3 ret
1d: 8d 74 26 00 lea 0x0(%rsi,%riz,1),%esi
21: 90 nop
22: c3 ret
23: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
2a:* 0f 0b ud2 <-- trapping instruction
2c: eb da jmp 0x8
2e: 3e 8d 74 26 00 ds lea 0x0(%rsi,%riz,1),%esi
33: a1 64 b8 2b c2 f6 c4 movabs 0x7502c4f6c22bb864,%eax
3a: 02 75
3c: 0d .byte 0xd
3d: 31 c0 xor %eax,%eax
3f: c3 ret

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: eb da jmp 0xffffffffffffffde
4: 3e 8d 74 26 00 ds lea 0x0(%rsi,%riz,1),%esi
9: a1 64 b8 2b c2 f6 c4 movabs 0x7502c4f6c22bb864,%eax
10: 02 75
12: 0d .byte 0xd
13: 31 c0 xor %eax,%eax
15: c3 ret
[ 215.165363][ T28] EAX: 00000004 EBX: 00000002 ECX: fffffffc EDX: 00000023
[ 215.172308][ T28] ESI: 00000023 EDI: c1045a80 EBP: c30fdeb8 ESP: c30fdeb4
[ 215.179248][ T28] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010002
[ 215.186717][ T28] CR0: 80050033 CR2: a95b2f4c CR3: 02400000 CR4: 003506f0
[ 215.193662][ T28] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 215.200601][ T28] DR6: fffe0ff0 DR7: 00000400
[ 215.205126][ T28] Call Trace:
[ 215.208266][ T28] default_send_IPI_single (arch/x86/kernel/apic/ipi.c:228)
[ 215.213396][ T28] apic_retrigger_irq (arch/x86/kernel/apic/vector.c:880)
[ 215.218095][ T28] irq_chip_retrigger_hierarchy (kernel/irq/chip.c:1473)
[ 215.223658][ T28] fixup_irqs (arch/x86/kernel/irq.c:371)
[ 215.227753][ T28] cpu_disable_common (arch/x86/kernel/smpboot.c:1663)
[ 215.232451][ T28] native_cpu_disable (arch/x86/kernel/smpboot.c:1694)
[ 215.237150][ T28] take_cpu_down (kernel/cpu.c:1035)
[ 215.241415][ T28] multi_cpu_stop (kernel/stop_machine.c:240)
[ 215.245766][ T28] cpu_stopper_thread (kernel/stop_machine.c:512)
[ 215.250549][ T28] ? stop_machine_yield+0x4/0x4
[ 215.255246][ T28] smpboot_thread_fn (kernel/smpboot.c:164 (discriminator 4))
[ 215.259943][ T28] kthread (kernel/kthread.c:376)
[ 215.263688][ T28] ? sort_range (kernel/smpboot.c:107)
[ 215.267858][ T28] ? kthread_complete_and_exit (kernel/kthread.c:331)
[ 215.273334][ T28] ret_from_fork (arch/x86/entry/entry_32.S:770)
[ 215.277599][ T28] ---[ end trace 0000000000000000 ]---

--
Best Regards,
Yujie

Attachment: dmesg.xz
Description: application/xz