Re: WARNING at blktests block/008 in ttwu_queue_wakelist()

From: Shinichiro Kawasaki
Date: Mon Jun 28 2021 - 00:43:57 EST


On Jun 10, 2021 / 16:32, Hillf Danton wrote:
> On Wed, 9 Jun 2021 06:55:59 +0000 Damien Le Moal wrote:
> >+ Jens and linux-kernel
> >
> >On 2021/06/09 15:53, Shinichiro Kawasaki wrote:
> >> Hi there,
> >>
> >> Let me share a blktests failure. When I ran blktests on the kernel v5.13-rc5,
> >> block/008 failed. A WARNING below was the cause of the failure.
> >>
> >> WARNING: CPU: 1 PID: 135817 at kernel/sched/core.c:3175 ttwu_queue_wakelist+0x284/0x2f0
> >>
> >> I'm trying to recreate the failure repeating the test case, but so far, I am
> >> not able to. This failure looks rare, but actually, I observed it 3 times in
> >> the past one year.
> >>
> >> 1) Oct/2020, kernel: v5.9-rc7 test dev: dm-flakey on AHCI-SATA SMR HDD, log [1]
> >> 2) Mar/2021, kernel: v5.12-rc2 test dev: AHCI-SATA SMR HDD, log [2]
> >> 3) Jun/2021, kernel: v5.13-rc5 test dev: dm-linear on null_blk zoned, log [3]
> >>
> >> The test case block/008 does IO during CPU hotplug, and the WARNING in
> >> ttwu_queue_wakelist() checks "WARN_ON_ONCE(cpu == smp_processor_id())".
> >> So it is likely that the test case triggers the warning, but I don't have
> >> clear view why and how the warning was triggered. It was observed on various
> >> block devices, so I would like to ask linux-block experts if anyone can tell
> >> what is going on. Comments will be appreciated.
>
> [...]
>
> >> [40041.712804][T135817] ------------[ cut here ]------------
> >> [40041.718489][T135817] WARNING: CPU: 1 PID: 135817 at kernel/sched/core.=
> >c:3175 ttwu_queue_wakelist+0x284/0x2f0
> >> [40041.728311][T135817] Modules linked in: null_blk dm_flakey iscsi_targe=
> >t_mod tcm_loop target_core_pscsi target_core_file target_core_iblock nft_fi=
> >b_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_=
> >reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip=
> >6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6=
> > nf_defrag_ipv4 iptable_mangle iptable_raw bridge iptable_security stp llc =
> >ip_set rfkill nf_tables target_core_user target_core_mod nfnetlink ip6table=
> >_filter ip6_tables iptable_filter sunrpc intel_rapl_msr intel_rapl_common x=
> >86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iTCO_=
> >wdt intel_pmc_bxt iTCO_vendor_support rapl intel_cstate intel_uncore joydev=
> > lpc_ich i2c_i801 i2c_smbus ses enclosure mei_me mei ipmi_ssif ioatdma wmi =
> >acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zr=
> >am ip_tables ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect crc=
> >32c_intel sysimgblt
> >> [40041.728481][T135817] fb_sys_fops cec drm_ttm_helper ttm ghash_clmulni=
> >_intel drm igb mpt3sas nvme dca i2c_algo_bit nvme_core raid_class scsi_tran=
> >sport_sas pkcs8_key_parser [last unloaded: null_blk]
> >> [40041.832215][T135817] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-r=
> >c5+ #1
> >> [40041.839262][T135817] Hardware name: Supermicro Super Server/X10SRL-F, =
> >BIOS 3.2 11/22/2019
> >> [40041.847434][T135817] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0
> >> [40041.853266][T135817] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4=
> >c 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 0=
> >0 eb a0 <0f> 0b 45 31 ff e9 cb fe ff ff 48 89 04 24 e8 49 71 64 00 48 8b 04=
> >
> >> [40041.872793][T135817] RSP: 0018:ffff888106bff348 EFLAGS: 00010046
> >> [40041.878800][T135817] RAX: 0000000000000001 RBX: ffff888117ec3240 RCX: =
> >ffff888811440000
> >> [40041.886711][T135817] RDX: 0000000000000000 RSI: 0000000000000001 RDI: =
> >ffffffffb603d6e8
> >> [40041.894625][T135817] RBP: 0000000000000001 R08: ffffffffb603d6e8 R09: =
> >ffffffffb6ba6167
> >> [40041.902537][T135817] R10: fffffbfff6d74c2c R11: 0000000000000001 R12: =
> >0000000000000000
> >> [40041.910451][T135817] R13: ffff88881145fd00 R14: 0000000000000001 R15: =
> >ffff888811440001
> >> [40041.918364][T135817] FS: 00007f8eabf14b80(0000) GS:ffff888811440000(0=
> >000) knlGS:0000000000000000
> >> [40041.927229][T135817] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033=
> >
> >> [40041.933756][T135817] CR2: 000055ce81e2cc78 CR3: 000000011be92005 CR4: =
> >00000000001706e0
> >> [40041.941669][T135817] Call Trace:
> >> [40041.944895][T135817] ? lock_is_held_type+0x98/0x110
> >> [40041.949860][T135817] try_to_wake_up+0x6f9/0x15e0
>
> 2) __queue_work
> raw_spin_lock(&pwq->pool->lock) with irq disabled
> insert_work
> wake_up_worker(pool);
> wake_up_process first_idle_worker(pool);
>
> Even if waker is lucky enough running on worker's CPU, what is weird is an
> idle worker can trigger the warning, given
>
> if (smp_load_acquire(&p->on_cpu) &&
> ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
> goto unlock;
>
> because p->on_cpu must have been false for quite a while.
>
> Is there any chance for CPU hotplug to make a difference?
>
> Thoughts are welcome.
>
> Hillf

Hillf, thank you very much for the comments, and sorry about this late reply.

I walked through related functions to understand your comments and, but I have
to say that I still don't have enough background knowledge to provide valuable
comments back to you. I understand that the waker and the wakee are on same CPU,
and it is weird that p->on_cpu is true. This looks indicating that the task
scheduler failing to control task status on task migration triggered by CPU
hotplugs, but as far as I read comments in kernel/sched/core.c, CPU hotplug and
task migration are folded into the design and implementation.

The blktests test case block/008 runs I/Os to a test target block device 30
seconds. During this I/O, it repeats offlining CPUs and onlining CPUs: when
there are N CPUs are available, it offlines N-1 CPUs to have only one online
CPU, then onlines all CPUs again. It repeats this online and offline until I/O
workload completes. When the only one CPU is online, the waker and the wakee can
be on the same CPU. Or, one of the waker or the wakee might have been migrated
from other CPUs to the only one online CPU. But still it is not clear for me why
it results in the WARNING.

Now I'm trying to recreate the failure. By repeating test cases in "block
group" on the kernel v5.13-rc5, I was able to recreate the failure. It took 3 to
5 hours to recreate it. The test target block device used was a null_blk with
rather unique configuration (zoned device with zone capacity smaller than zone
size). I will try to confirm the failure recreation with latest kernel version
and other block devices.

--
Best Regards,
Shin'ichiro Kawasaki

>
> >> [40041.954567][T135817] ? migrate_swap_stop+0x970/0x970
> >> [40041.959618][T135817] ? insert_work+0x193/0x2e0
> >> [40041.964152][T135817] __queue_work+0x4e8/0xdb0
> >> [40041.968599][T135817] ? try_to_grab_pending.part.0+0x439/0x530
> >> [40041.974429][T135817] ? hctx_unlock+0x7d/0xe0
> >> [40041.978790][T135817] mod_delayed_work_on+0x8c/0x100
> >> [40041.983755][T135817] ? cancel_delayed_work+0x1f0/0x1f0
> >> [40041.988982][T135817] ? __blk_mq_delay_run_hw_queue+0x95/0x4b0
> >> [40041.994817][T135817] kblockd_mod_delayed_work_on+0x17/0x20
>
> 1) try to queue work on kblockd_workqueue that is bound percpu workqueue, so
> kworker is bound to one CPU.
>
> >> [40042.000390][T135817] blk_mq_run_hw_queue+0x125/0x270
> >> [40042.005439][T135817] ? blk_mq_delay_run_hw_queues+0x410/0x410
> >> [40042.011275][T135817] blk_mq_sched_insert_request+0x294/0x420
> >> [40042.017020][T135817] ? __blk_mq_sched_bio_merge+0x340/0x340
> >> [40042.022677][T135817] ? blk_mq_rq_ctx_init+0x99a/0xe80
> >> [40042.027819][T135817] blk_mq_submit_bio+0xb37/0x15f0
> >> [40042.032786][T135817] ? blk_mq_try_issue_list_directly+0x940/0x940
> >> [40042.038966][T135817] ? percpu_ref_put_many.constprop.0+0x82/0x1b0
> >> [40042.045148][T135817] submit_bio_noacct+0x79c/0xe60
> >> [40042.050023][T135817] ? blk_queue_enter+0x810/0x810
> >> [40042.054902][T135817] ? find_held_lock+0x2c/0x110
> >> [40042.059611][T135817] submit_bio+0xe4/0x4c0
> >> [40042.063795][T135817] ? submit_bio_noacct+0xe60/0xe60
> >> [40042.068847][T135817] ? bio_add_zone_append_page+0x2c0/0x2c0
> >> [40042.074507][T135817] ? bio_release_pages.part.0+0x10c/0x3d0
> >> [40042.080167][T135817] ? __blkdev_direct_IO_simple+0x436/0x7d0
> >> [40042.085912][T135817] __blkdev_direct_IO_simple+0x40d/0x7d0
> >> [40042.091488][T135817] ? bd_link_disk_holder+0x6e0/0x6e0
> >> [40042.096711][T135817] ? __lock_acquire+0xbbc/0x51b0
> >> [40042.101591][T135817] ? mark_lock+0xe4/0x18a0
> >> [40042.105953][T135817] ? set_init_blocksize.isra.0+0x140/0x140
> >> [40042.111698][T135817] ? mark_lock+0xe4/0x18a0
> >> [40042.116054][T135817] ? lock_is_held_type+0x98/0x110
> >> [40042.121021][T135817] ? find_held_lock+0x2c/0x110
> >> [40042.125730][T135817] blkdev_direct_IO+0xd23/0x11d0
> >> [40042.130607][T135817] ? __lock_acquire+0x15c2/0x51b0
> >> [40042.135577][T135817] ? bd_prepare_to_claim+0x2a0/0x2a0
> >> [40042.140799][T135817] ? __lock_acquire+0x15c2/0x51b0
> >> [40042.145770][T135817] generic_file_read_iter+0x1f4/0x470
> >> [40042.151080][T135817] blkdev_read_iter+0x100/0x190
> >> [40042.155871][T135817] new_sync_read+0x352/0x5d0
> >> [40042.160403][T135817] ? __ia32_sys_llseek+0x310/0x310
> >> [40042.165458][T135817] ? __cond_resched+0x15/0x30
> >> [40042.170076][T135817] ? inode_security+0x56/0xf0
> >> [40042.174698][T135817] vfs_read+0x26c/0x470
> >> [40042.178796][T135817] __x64_sys_pread64+0x17d/0x1d0
> >> [40042.183674][T135817] ? vfs_read+0x470/0x470
> >> [40042.187946][T135817] ? syscall_enter_from_user_mode+0x27/0x70
> >> [40042.193780][T135817] ? trace_hardirqs_on+0x1c/0x110
> >> [40042.198747][T135817] do_syscall_64+0x40/0x80
> >> [40042.203107][T135817] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> [40042.208939][T135817] RIP: 0033:0x7f8eb5cdb96f
> >> [40042.213297][T135817] Code: 08 89 3c 24 48 89 4c 24 18 e8 7d f3 ff ff 4=
> >c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 0=
> >0 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 cd f3 ff ff 48 8b=
> >
> >> [40042.232824][T135817] RSP: 002b:00007fffaa10c9d0 EFLAGS: 00000293 ORIG_=
> >RAX: 0000000000000011
> >> [40042.241169][T135817] RAX: ffffffffffffffda RBX: 00000000012e2540 RCX: =
> >00007f8eb5cdb96f
> >> [40042.249074][T135817] RDX: 0000000000001000 RSI: 000000000130a000 RDI: =
> >0000000000000009
> >> [40042.256985][T135817] RBP: 00000000012e2540 R08: 0000000000000000 R09: =
> >00007fffaa163080
> >> [40042.264898][T135817] R10: 0000000000862000 R11: 0000000000000293 R12: =
> >00007f8e96829458
> >> [40042.272813][T135817] R13: 0000000000000000 R14: 0000000000001000 R15: =
> >00000000012e2568
> >> [40042.280732][T135817] irq event stamp: 1042448
> >> [40042.285084][T135817] hardirqs last enabled at (1042447): [<ffffffffb3=
> >20209e>] try_to_grab_pending.part.0+0x1ae/0x530
> >> [40042.295683][T135817] hardirqs last disabled at (1042448): [<ffffffffb3=
> >202b3b>] mod_delayed_work_on+0xcb/0x100
> >> [40042.305591][T135817] softirqs last enabled at (1041222): [<ffffffffb3=
> >1b5316>] __irq_exit_rcu+0x1c6/0x270
> >> [40042.315150][T135817] softirqs last disabled at (1041215): [<ffffffffb3=
> >1b5316>] __irq_exit_rcu+0x1c6/0x270
> >> [40042.324711][T135817] ---[ end trace b6f997c9a553aa02 ]---
> >> [40042.330118][ C1]
> >> [40042.330119][ C1] =================
> >==========================
> >=============
> >> [40042.330121][ C1] WARNING: possible circular locking dependency dete=
> >cted
> >> [40042.330123][ C1] 5.13.0-rc5+ #1 Not tainted
> >> [40042.330125][ C1] --------------------------------------------------=
> >----
> >> [40042.330126][ C1] fio/135817 is trying to acquire lock:
> >> [40042.330128][ C1] ffffffffb641acb8 ((console_sem).lock){-.-.}-{2:2},=
> > at: down_trylock+0x13/0x70
> >> [40042.330134][ C1]
> >> [40042.330135][ C1] but task is already holding lock:
> >> [40042.330137][ C1] ffff888117ec3f30 (&p->pi_lock){-.-.}-{2:2}, at: tr=
> >y_to_wake_up+0x88/0x15e0
> >> [40042.330142][ C1]
> >> [40042.330144][ C1] which lock already depends on the new lock.
> >> [40042.330145][ C1]
> >> [40042.330146][ C1]
> >> [40042.330147][ C1] the existing dependency chain (in reverse order) i=
> >s:
> >> [40042.330149][ C1]
> >> [40042.330150][ C1] -> #1 (&p->pi_lock){-.-.}-{2:2}:
> >> [40042.330155][ C1] _raw_spin_lock_irqsave+0x3b/0x60
> >> [40042.330156][ C1] try_to_wake_up+0x88/0x15e0
> >> [40042.330157][ C1] up+0x7a/0xb0
> >> [40042.330159][ C1] __up_console_sem+0x2d/0x60
> >> [40042.330161][ C1] console_unlock+0x3f7/0x840
> >> [40042.330162][ C1] vprintk_emit+0x257/0x420
> >> [40042.330163][ C1] devkmsg_emit.constprop.0+0x95/0xb1
> >> [40042.330165][ C1] devkmsg_write.cold+0x48/0x74
> >> [40042.330166][ C1] do_iter_readv_writev+0x32b/0x6b0
> >> [40042.330167][ C1] do_iter_write+0x137/0x5d0
> >> [40042.330169][ C1] vfs_writev+0x147/0x4a0
> >> [40042.330170][ C1] do_writev+0x100/0x260
> >> [40042.330172][ C1] do_syscall_64+0x40/0x80
> >> [40042.330173][ C1] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> [40042.330175][ C1]
> >> [40042.330176][ C1] -> #0 ((console_sem).lock){-.-.}-{2:2}:
> >> [40042.330181][ C1] __lock_acquire+0x2940/0x51b0
> >> [40042.330182][ C1] lock_acquire+0x181/0x6a0
> >> [40042.330183][ C1] _raw_spin_lock_irqsave+0x3b/0x60
> >> [40042.330184][ C1] down_trylock+0x13/0x70
> >> [40042.330185][ C1] __down_trylock_console_sem+0x24/0x90
> >> [40042.330186][ C1] vprintk_emit+0x230/0x420
> >> [40042.330187][ C1] printk+0x96/0xb2
> >> [40042.330187][ C1] report_bug.cold+0x31/0x53
> >> [40042.330188][ C1] handle_bug+0x3c/0x70
> >> [40042.330189][ C1] exc_invalid_op+0x14/0x40
> >> [40042.330189][ C1] asm_exc_invalid_op+0x12/0x20
> >> [40042.330190][ C1] ttwu_queue_wakelist+0x284/0x2f0
> >> [40042.330191][ C1] try_to_wake_up+0x6f9/0x15e0
> >> [40042.330192][ C1] __queue_work+0x4e8/0xdb0
> >> [40042.330193][ C1] mod_delayed_work_on+0x8c/0x100
> >> [40042.330194][ C1] kblockd_mod_delayed_work_on+0x17/0x20
> >> [40042.330194][ C1] blk_mq_run_hw_queue+0x125/0x270
> >> [40042.330195][ C1] blk_mq_sched_insert_request+0x294/0x420
> >> [40042.330196][ C1] blk_mq_submit_bio+0xb37/0x15f0
> >> [40042.330197][ C1] submit_bio_noacct+0x79c/0xe60
> >> [40042.330198][ C1] submit_bio+0xe4/0x4c0
> >> [40042.330199][ C1] __blkdev_direct_IO_simple+0x40d/0x7d0
> >> [40042.330199][ C1] blkdev_direct_IO+0xd23/0x11d0
> >> [40042.330200][ C1] generic_file_read_iter+0x1f4/0x470
> >> [40042.330201][ C1] blkdev_read_iter+0x100/0x190
> >> [40042.330202][ C1] new_sync_read+0x352/0x5d0
> >> [40042.330203][ C1] vfs_read+0x26c/0x470
> >> [40042.330203][ C1] __x64_sys_pread64+0x17d/0x1d0
> >> [40042.330204][ C1] do_syscall_64+0x40/0x80
> >> [40042.330205][ C1] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> [40042.330206][ C1]
> >> [40042.330206][ C1] other info that might help us debug this:
> >> [40042.330207][ C1]
> >> [40042.330208][ C1] Possible unsafe locking scenario:
> >> [40042.330208][ C1]
> >> [40042.330209][ C1] CPU0 CPU1
> >> [40042.330210][ C1] ---- ----
> >> [40042.330210][ C1] lock(&p->pi_lock);
> >> [40042.330212][ C1] lock((console_sem).=
> >lock);
> >> [40042.330214][ C1] lock(&p->pi_lock);=
> >
> >> [40042.330216][ C1] lock((console_sem).lock);
> >> [40042.330217][ C1]
> >> [40042.330218][ C1] *** DEADLOCK ***
> >> [40042.330218][ C1]
> >> [40042.330219][ C1] 3 locks held by fio/135817:
> >> [40042.330220][ C1] #0: ffffffffb643ae80 (rcu_read_lock){....}-{1:2},=
> > at: __queue_work+0xa0/0xdb0
> >> [40042.330223][ C1] #1: ffff8888114730d8 (&pool->lock){-.-.}-{2:2}, a=
> >t: __queue_work+0x2ce/0xdb0
> >> [40042.330227][ C1] #2: ffff888117ec3f30 (&p->pi_lock){-.-.}-{2:2}, a=
> >t: try_to_wake_up+0x88/0x15e0
> >> [40042.330230][ C1]
> >> [40042.330231][ C1] stack backtrace:
> >> [40042.330231][ C1] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-rc=
> >5+ #1
> >> [40042.330232][ C1] Hardware name: Supermicro Super Server/X10SRL-F, B=
> >IOS 3.2 11/22/2019
> >> [40042.330233][ C1] Call Trace:
> >> [40042.330234][ C1] dump_stack+0x93/0xc2
> >> [40042.330234][ C1] check_noncircular+0x235/0x2b0
> >> [40042.330235][ C1] ? print_circular_bug+0x1f0/0x1f0
> >> [40042.330236][ C1] ? enable_ptr_key_workfn+0x20/0x20
> >> [40042.330237][ C1] ? desc_read+0x218/0x2e0
> >> [40042.330237][ C1] ? vsnprintf+0x830/0x15c0
> >> [40042.330238][ C1] __lock_acquire+0x2940/0x51b0
> >> [40042.330239][ C1] ? printk_sprint+0x89/0x110
> >> [40042.330239][ C1] ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
> >> [40042.330240][ C1] ? vprintk_store+0x581/0x630
> >> [40042.330241][ C1] lock_acquire+0x181/0x6a0
> >> [40042.330242][ C1] ? down_trylock+0x13/0x70
> >> [40042.330242][ C1] ? lock_release+0x680/0x680
> >> [40042.330243][ C1] ? lock_chain_count+0x20/0x20
> >> [40042.330244][ C1] ? lock_downgrade+0x6a0/0x6a0
> >> [40042.330245][ C1] ? printk+0x96/0xb2
> >> [40042.330246][ C1] _raw_spin_lock_irqsave+0x3b/0x60
> >> [40042.330248][ C1] ? down_trylock+0x13/0x70
> >> [40042.330249][ C1] down_trylock+0x13/0x70
> >> [40042.330250][ C1] ? printk+0x96/0xb2
> >> [40042.330252][ C1] __down_trylock_console_sem+0x24/0x90
> >> [40042.330253][ C1] vprintk_emit+0x230/0x420
> >> [40042.330254][ C1] ? ttwu_queue_wakelist+0x284/0x2f0
> >> [40042.330255][ C1] printk+0x96/0xb2
> >> [40042.330256][ C1] ? record_print_text.cold+0x11/0x11
> >> [40042.330258][ C1] report_bug.cold+0x31/0x53
> >> [40042.330259][ C1] ? ttwu_queue_wakelist+0x284/0x2f0
> >> [40042.330260][ C1] handle_bug+0x3c/0x70
> >> [40042.330262][ C1] exc_invalid_op+0x14/0x40
> >> [40042.330263][ C1] asm_exc_invalid_op+0x12/0x20
> >> [40042.330264][ C1] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0
> >> [40042.330266][ C1] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4c=
> > 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 00=
> > eb a0 <0f> 0b 45 31 ff e9 cb
> >> [40042.330269][ C1] Lost 69 message(s)!
> >>
> >
> >
> >--
> >Damien Le Moal
> >Western Digital Research