Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

From: Schspa Shi
Date: Sat Jul 30 2022 - 00:20:08 EST



Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:

> On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>>
>>
>> CC Peter.
>> Peter has changed the CPU binding code in workqueue.c.
>
> [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
> [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
> ^^^^^^^^^^^^^^^^^^^^^
>
> I think we can ignore this as being some ancient kernel. Please try
> something recent.

Hi peter:

I spent a few days writing a test case and reproduced the problem on
kernel 5.19. I think it's time for us to review the V3 patch for a fix.

The V3 patch is at
https://lore.kernel.org/all/20220714031645.28004-1-schspa@xxxxxxxxx/
Please help to review it.

Test branch as:
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8

I think this code is new enough to demonstrate that the problem persists.

The log as fellowing:

[ 3103.198684] ------------[ cut here ]------------
[ 3103.198684] Dying CPU not properly vacated!
[ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] Modules linked in: work_test(O)
[ 3103.198684] CPU: 1 PID: 23 Comm: migration/1 Tainted: G O 5.19.0-rc8-rt8 #1
[ 3103.198684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 3103.198684] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0x132/0x170
[ 3103.198684] RIP: 0010:sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] Code: 00 e9 a1 c1 40 ff 48 c7 c7 48 91 94 82 e8 99 29 00 00 48 c7 c7 00 5e 53 83 e9 e3 10 50 ff 48 c7 c7 98 91 94 82 e8 4f ec ff ff <0f> 0b 44 8b ab 10 0a 00 00 8b 4b 04 48 c7 c6 cd 37 93 82 48 c7 c7
[ 3103.198684] RSP: 0000:ffffc900000dbda0 EFLAGS: 00010086
[ 3103.198684] RAX: 0000000000000000 RBX: ffff88813bcaa280 RCX: 0000000000000000
[ 3103.198684] RDX: 0000000000000003 RSI: ffffffff82953971 RDI: 00000000ffffffff
[ 3103.198684] RBP: 0000000000000082 R08: 00000000000021ed R09: ffffc900000dbd38
[ 3103.198684] R10: 0000000000000001 R11: ffffffffffffffff R12: 0000000000000060
[ 3103.198684] R13: ffffffff810a9040 R14: ffffffff82c555c0 R15: 0000000000000000
[ 3103.198684] FS: 0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000
[ 3103.198684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3103.198684] CR2: 00007f85acd18010 CR3: 0000000102578000 CR4: 0000000000350ee0
[ 3103.198684] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3103.198684] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3103.198684] Call Trace:
[ 3103.198684] <TASK>
[ 3103.198684] ? sched_cpu_wait_empty+0x60/0x60
[ 3103.198684] cpuhp_invoke_callback+0x3a4/0x5f0
[ 3103.198684] take_cpu_down+0x71/0xd0
[ 3103.198684] multi_cpu_stop+0x5c/0xf0
[ 3103.198684] ? stop_machine_yield+0x10/0x10
[ 3103.198684] cpu_stopper_thread+0x82/0x130
[ 3103.198684] smpboot_thread_fn+0x1bb/0x2b0
[ 3103.198684] ? sort_range+0x20/0x20
[ 3103.198684] kthread+0xfe/0x120
[ 3103.198684] ? kthread_complete_and_exit+0x20/0x20
[ 3103.198684] ret_from_fork+0x1f/0x30
[ 3103.198684] </TASK>
[ 3103.198684] Kernel panic - not syncing: panic_on_warn set ...
[ 3103.198684] CPU: 1 PID: 23 Comm: migration/1 Tainted: G O 5.19.0-rc8-rt8 #1
[ 3103.198684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 3103.198684] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0x132/0x170
[ 3103.198684] Call Trace:
[ 3103.198684] <TASK>
[ 3103.198684] dump_stack_lvl+0x34/0x48
[ 3103.198684] panic+0xf8/0x299
[ 3103.198684] ? sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] __warn.cold+0x43/0xba
[ 3103.198684] ? sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] report_bug+0x9d/0xc0
[ 3103.198684] handle_bug+0x3c/0x70
[ 3103.198684] exc_invalid_op+0x14/0x70
[ 3103.198684] asm_exc_invalid_op+0x16/0x20
[ 3103.198684] RIP: 0010:sched_cpu_dying.cold+0xc/0xc3

--
BRs
Schspa Shi