Re: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB

From: kernel test robot
Date: Thu Oct 19 2023 - 10:56:31 EST




Hello,

kernel test robot noticed "WARNING:at_kernel/sched/core.c:#nohz_csd_func" on:

commit: 7b0c45f5095f8868fb14cc4e1745befdf58d173c ("[PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB")
url: https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/sched-fair-Avoid-unnecessary-IPIs-for-ILB/20231006-003907
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3006adf3be79cde4d14b1800b963b82b6e5572e0
patch link: https://lore.kernel.org/all/20231005161727.1855004-1-joel@xxxxxxxxxxxxxxxxx/
patch subject: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB

in testcase: blktests
version: blktests-x86_64-3f75e62-1_20231017
with following parameters:

disk: 1SSD
test: nvme-group-00
nvme_trtype: rdma



compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


+----------------+------------+------------+
| | 3006adf3be | 7b0c45f509 |
+----------------+------------+------------+
| boot_successes | 0 | 3 |
+----------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
| Closes: https://lore.kernel.org/oe-lkp/202310192232.750e5c5b-oliver.sang@xxxxxxxxx


[ 55.309389][ C1] ------------[ cut here ]------------
[ 55.315508][ C1] WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:1182 nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1))
[ 55.325508][ C1] Modules linked in: intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp btrfs blake2b_generic kvm_intel xor kvm raid6_pq zstd_compress irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel libcrc32c sha512_ssse3 crc32c_intel ipmi_ssif rapl nvme intel_cstate nvme_core mei_me ast t10_pi dax_hmem drm_shmem_helper crc64_rocksoft_generic idxd crc64_rocksoft mei drm_kms_helper wmi idxd_bus joydev i2c_ismt crc64 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad drm fuse ip_tables
[ 55.380240][ C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.6.0-rc4-00038-g7b0c45f5095f #1
[ 55.390037][ C1] RIP: 0010:nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1))
[ 55.396018][ C1] Code: 84 c0 74 06 0f 8e d3 00 00 00 45 88 b4 24 28 0a 00 00 48 83 c4 08 bf 07 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 22 b0 f6 ff <0f> 0b e9 1b fe ff ff e8 76 6e 72 00 e9 66 fd ff ff e8 cc 6e 72 00
All code
========
0: 84 c0 test %al,%al
2: 74 06 je 0xa
4: 0f 8e d3 00 00 00 jle 0xdd
a: 45 88 b4 24 28 0a 00 mov %r14b,0xa28(%r12)
11: 00
12: 48 83 c4 08 add $0x8,%rsp
16: bf 07 00 00 00 mov $0x7,%edi
1b: 5b pop %rbx
1c: 41 5c pop %r12
1e: 41 5d pop %r13
20: 41 5e pop %r14
22: 41 5f pop %r15
24: 5d pop %rbp
25: e9 22 b0 f6 ff jmpq 0xfffffffffff6b04c
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 1b fe ff ff jmpq 0xfffffffffffffe4c
31: e8 76 6e 72 00 callq 0x726eac
36: e9 66 fd ff ff jmpq 0xfffffffffffffda1
3b: e8 cc 6e 72 00 callq 0x726f0c

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 1b fe ff ff jmpq 0xfffffffffffffe22
7: e8 76 6e 72 00 callq 0x726e82
c: e9 66 fd ff ff jmpq 0xfffffffffffffd77
11: e8 cc 6e 72 00 callq 0x726ee2
[ 55.418037][ C1] RSP: 0018:ffa00000001f8f58 EFLAGS: 00010046
[ 55.424802][ C1] RAX: 0000000000000000 RBX: 000000000003a100 RCX: ffffffff8444c928
[ 55.433718][ C1] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ff110017fc8ba164
[ 55.442631][ C1] RBP: ffa00000001f8f88 R08: 0000000000000001 R09: ffe21c02ff91742c
[ 55.451542][ C1] R10: ff110017fc8ba167 R11: ffa00000001f8ff8 R12: ff110017fc8ba100
[ 55.460461][ C1] R13: ff110017fc8ba164 R14: 0000000000000000 R15: 0000000000000001
[ 55.470959][ C1] FS: 0000000000000000(0000) GS:ff110017fc880000(0000) knlGS:0000000000000000
[ 55.482348][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 55.491067][ C1] CR2: 00007fabd7bff699 CR3: 000000407de46006 CR4: 0000000000f71ee0
[ 55.501337][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 55.511601][ C1] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 55.521859][ C1] PKRU: 55555554
[ 55.527131][ C1] Call Trace:
[ 55.532072][ C1] <IRQ>
[ 55.536527][ C1] ? __warn (kernel/panic.c:673)
[ 55.542341][ C1] ? nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1))
[ 55.548935][ C1] ? report_bug (lib/bug.c:180 lib/bug.c:219)
[ 55.555241][ C1] ? handle_bug (arch/x86/kernel/traps.c:237)
[ 55.561323][ C1] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1))
[ 55.567792][ C1] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
[ 55.574671][ C1] ? nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1))
[ 55.581230][ C1] ? nohz_csd_func (arch/x86/include/asm/atomic.h:23 arch/x86/include/asm/atomic.h:135 include/linux/atomic/atomic-arch-fallback.h:1433 include/linux/atomic/atomic-arch-fallback.h:1565 include/linux/atomic/atomic-instrumented.h:862 kernel/sched/core.c:1181)
[ 55.587667][ C1] ? task_mm_cid_work (kernel/sched/core.c:1173)
[ 55.594511][ C1] __flush_smp_call_function_queue (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/csd.h:64 kernel/smp.c:134 kernel/smp.c:531)
[ 55.602619][ C1] __sysvec_call_function_single (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 arch/x86/include/asm/trace/irq_vectors.h:99 arch/x86/kernel/smp.c:293)
[ 55.610431][ C1] sysvec_call_function_single (arch/x86/kernel/smp.c:287 (discriminator 14))
[ 55.617918][ C1] </IRQ>
[ 55.622373][ C1] <TASK>
[ 55.624388][ C2] ------------[ cut here ]------------
[ 55.625607][ C1] asm_sysvec_call_function_single (arch/x86/include/asm/idtentry.h:652)
[ 55.631669][ C2] WARNING: CPU: 2 PID: 0 at kernel/sched/core.c:1182 nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1))
[ 55.638279][ C1] RIP: _nohz_idle_balance+0xd9/0x7f0
[ 55.648220][ C2] Modules linked in:
[ 55.655250][ C1] Code: 48 74 0a c7 05 c0 0f ce 04 00 00 00 00 8b 44 24 2c 83 e0 08 89 44 24 14 74 0a c7 05 ad 0f ce 04 00 00 00 00 f0 83 44 24 fc 00 <49> c7 c5 10 c4 3f 85 41 83 c4 01 48 b8 00 00 00 00 00 fc ff df 4c
All code
========
0: 48 74 0a rex.W je 0xd
3: c7 05 c0 0f ce 04 00 movl $0x0,0x4ce0fc0(%rip) # 0x4ce0fcd
a: 00 00 00
d: 8b 44 24 2c mov 0x2c(%rsp),%eax
11: 83 e0 08 and $0x8,%eax
14: 89 44 24 14 mov %eax,0x14(%rsp)
18: 74 0a je 0x24
1a: c7 05 ad 0f ce 04 00 movl $0x0,0x4ce0fad(%rip) # 0x4ce0fd1
21: 00 00 00
24: f0 83 44 24 fc 00 lock addl $0x0,-0x4(%rsp)
2a:* 49 c7 c5 10 c4 3f 85 mov $0xffffffff853fc410,%r13 <-- trapping instruction
31: 41 83 c4 01 add $0x1,%r12d
35: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
3c: fc ff df
3f: 4c rex.WR

Code starting with the faulting instruction
===========================================
0: 49 c7 c5 10 c4 3f 85 mov $0xffffffff853fc410,%r13
7: 41 83 c4 01 add $0x1,%r12d
b: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
12: fc ff df
15: 4c rex.WR
[ 55.656534][ C2] intel_rapl_msr
[ 55.660852][ C1] RSP: 0018:ffa000000865fdb0 EFLAGS: 00000246
[ 55.682783][ C2] intel_rapl_common
[ 55.686779][ C1] RAX: 0000000000000008 RBX: 0000000000000001 RCX: ffffffff812b76c7
[ 55.693493][ C2] x86_pkg_temp_thermal
[ 55.697774][ C1] RDX: dffffc0000000000 RSI: 0000000000000008 RDI: 0000000000000001
[ 55.706653][ C2] intel_powerclamp
[ 55.711232][ C1] RBP: ffa000000865fe90 R08: 0000000000000001 R09: ffe21c02ff91742c
[ 55.720120][ C2] coretemp btrfs
[ 55.724298][ C1] R10: ff110017fc8ba167 R11: 0000000000000014 R12: 0000000000000001
[ 55.733156][ C2] blake2b_generic
[ 55.737157][ C1] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 55.746019][ C2] kvm_intel xor
[ 55.750115][ C1] ? nohz_run_idle_balance (arch/x86/include/asm/atomic.h:23 arch/x86/include/asm/atomic.h:135 include/linux/atomic/atomic-arch-fallback.h:1433 include/linux/atomic/atomic-arch-fallback.h:1565 include/linux/atomic/atomic-instrumented.h:862 kernel/sched/fair.c:11954)
[ 55.758991][ C2] kvm
[ 55.762902][ C1] ? clockevents_program_event (kernel/time/clockevents.c:336 (discriminator 3))
[ 55.768839][ C2] raid6_pq zstd_compress
[ 55.771772][ C1] ? rebalance_domains (kernel/sched/fair.c:11826)
[ 55.778197][ C2] irqbypass
[ 55.782972][ C1] ? __flush_smp_call_function_queue (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/csd.h:64 kernel/smp.c:134 kernel/smp.c:531)
[ 55.788612][ C2] crct10dif_pclmul crc32_pclmul
[ 55.792132][ C1] do_idle (arch/x86/include/asm/current.h:41 include/linux/sched/idle.h:31 kernel/sched/idle.c:255)
[ 55.799153][ C2] ghash_clmulni_intel
[ 55.804630][ C1] cpu_startup_entry (kernel/sched/idle.c:379 (discriminator 1))
[ 55.809028][ C2] libcrc32c sha512_ssse3
[ 55.813499][ C1] start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
[ 55.818765][ C2] crc32c_intel
[ 55.823547][ C1] ? set_cpu_sibling_map (arch/x86/kernel/smpboot.c:240)
[ 55.828795][ C2] ipmi_ssif
[ 55.832605][ C1] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)
[ 55.838652][ C2] rapl
[ 55.842150][ C1] </TASK>
[ 55.848889][ C2] nvme
[ 55.851904][ C1] ---[ end trace 0000000000000000 ]---
[ 55.855226][ C2] intel_cstate
[ 55.856376][ T1] systemd[1]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 55.929716][ C2] nvme_core mei_me ast t10_pi dax_hmem drm_shmem_helper crc64_rocksoft_generic idxd crc64_rocksoft mei drm_kms_helper wmi idxd_bus joydev i2c_ismt crc64 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad drm fuse ip_tables
[ 55.958456][ C2] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G W 6.6.0-rc4-00038-g7b0c45f5095f #1
[ 55.971146][ C2] RIP: 0010:nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1))
[ 55.978359][ C2] Code: 84 c0 74 06 0f 8e d3 00 00 00 45 88 b4 24 28 0a 00 00 48 83 c4 08 bf 07 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 22 b0 f6 ff <0f> 0b e9 1b fe ff ff e8 76 6e 72 00 e9 66 fd ff ff e8 cc 6e 72 00
All code
========
0: 84 c0 test %al,%al
2: 74 06 je 0xa
4: 0f 8e d3 00 00 00 jle 0xdd
a: 45 88 b4 24 28 0a 00 mov %r14b,0xa28(%r12)
11: 00
12: 48 83 c4 08 add $0x8,%rsp
16: bf 07 00 00 00 mov $0x7,%edi
1b: 5b pop %rbx
1c: 41 5c pop %r12
1e: 41 5d pop %r13
20: 41 5e pop %r14
22: 41 5f pop %r15
24: 5d pop %rbp
25: e9 22 b0 f6 ff jmpq 0xfffffffffff6b04c
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 1b fe ff ff jmpq 0xfffffffffffffe4c
31: e8 76 6e 72 00 callq 0x726eac
36: e9 66 fd ff ff jmpq 0xfffffffffffffda1
3b: e8 cc 6e 72 00 callq 0x726f0c

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 1b fe ff ff jmpq 0xfffffffffffffe22
7: e8 76 6e 72 00 callq 0x726e82
c: e9 66 fd ff ff jmpq 0xfffffffffffffd77
11: e8 cc 6e 72 00 callq 0x726ee2


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231019/202310192232.750e5c5b-oliver.sang@xxxxxxxxx



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki