Re: re. Spurious wakeup on a newly created kthread

From: Petr Mladek
Date: Tue Jun 28 2022 - 11:08:19 EST


On Mon 2022-06-27 10:07:22, Michal Hocko wrote:
> On Sat 25-06-22 14:00:10, Tejun Heo wrote:
> [...]
> > tl;dr is that the worker creation code expects a newly created worker
> > kthread to sit tight until the creator finishes setting up stuff and
> > sends the initial wakeup. However, something, which wasn't identified
> > in the report (Petr, it'd be great if you can find out who did the
> > wakeup), wakes up the new kthread before the creation path is done
> > with init which causes the new kthread to try to deref a NULL pointer.
>
> One thing that hasn't been really mentioned explicitly but it might be
> really important. 5.3.18-2-rt is the RT patchset backported to 5.3
> based SLES kernel. We do not know whether the upstream RT patchset is
> affected the same way and due to nature of the customer we cannot really
> have this tested on that kernel.
>
> I have tried to dig that out from the crash dump but no luck
> unfortunately. I was able to find the kthread pointer on few stacks:

I tried to dig more but I am not sure if it explains the problem.

The newly created worker was woken from "sysmon" process because
it was in wake_q_sleeper queue for its own &task->sighand->siglock.
I should not cause problems if the worker was really blocked
and waiting to this spin lock.

It looks to me that ptrace might in theory cause the problematic
wakeup.

Note that it is SUSE kernel 5.3 with backported RT patchset.


See below for more details.

Michal found this process by searching for ffff9ca77e9a8000
on process stacks.

> sysmon is
> crash> bt ffff9ca75e6349c0
> PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon"
> #0 [ffffb0f6808e7be8] __schedule at ffffffffaa94bef1
> #1 [ffffb0f6808e7c78] preempt_schedule_common at ffffffffaa94c5d1
> #2 [ffffb0f6808e7c80] ___preempt_schedule at ffffffffaa201bb6
> #3 [ffffb0f6808e7cd8] rt_mutex_postunlock at ffffffffaa309f6b
> #4 [ffffb0f6808e7ce8] rt_mutex_futex_unlock at ffffffffaa94ea7c
> #5 [ffffb0f6808e7d30] rt_spin_unlock at ffffffffaa950ace
> #6 [ffffb0f6808e7d38] proc_pid_status at ffffffffaa4c9ff4
> #7 [ffffb0f6808e7dd8] proc_single_show at ffffffffaa4c3801
> #8 [ffffb0f6808e7e10] seq_read at ffffffffaa47c1b8
> #9 [ffffb0f6808e7e70] vfs_read at ffffffffaa455fd9
> #10 [ffffb0f6808e7ea0] ksys_read at ffffffffaa456364
>
> so collecting /pro/status data and releasing the sighand lock.


I found the pointer to the task struct actually twice there:

1. Direct pointer to task struct:

The stack of the process "sysmon" is here:

#5 [ffffb0f6808e7d30] rt_spin_unlock+0x3e at ffffffffaa950ace
ffffb0f6808e7d38: proc_pid_status+1636
#6 [ffffb0f6808e7d38] proc_pid_status+0x664 at ffffffffaa4c9ff4
ffffb0f6808e7d40: init_user_ns init_pid_ns
ffffb0f6808e7d50: 0000000000000000 0000000000000000
ffffb0f6808e7d60: 00003f550019fca0 ffff9ca700000000
ffffb0f6808e7d70: 0000000000000000 0000000000000000
ffffb0f6808e7d80: 0000000000000000 0000000000000000
ffffb0f6808e7d90: ffffffffffffffff 0000000000000000
ffffb0f6808e7da0: 2c93096daf3bc600 [ffff9ca77e9a8000:task_struct]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ffffb0f6808e7db0: [ffff9ca7602f6408:proc_inode_cache] [ffff9ca77e934908:seq_file]
ffffb0f6808e7dc0: [ffff9ca71bd7ad80:pid] init_pid_ns
ffffb0f6808e7dd0: [ffff9ca77e934908:seq_file] proc_single_show+81

So, the pointer on an actively used and valid stack.

It seems it is the *task_struct that was passed to rec_pid_status().
Therefore, proc_pid_status is showing status of the newly created
worker that gets woken prematurely.

According to the assembly, it is the code:

+ proc_pid_status()
+ task_sig()
+ unlock_task_sighand()
+ spin_unlock_irqrestore(&task->sighand->siglock)
+ rt_spin_unlock()


2. Indirect pointer to task struct:

The process "sysmon" is interrupted in rt_mutex_postunlock().
It looks in this kernel like:

void rt_mutex_postunlock(struct wake_q_head *wake_q,
struct wake_q_head *wake_sleeper_q)
{
wake_up_q(wake_q);
wake_up_q_sleeper(wake_sleeper_q);

/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
preempt_enable();
}

Note that pointers to wake_q and wake_sleeper_q are passed as
parameters.

They point to the related fields in task-struct:

crash> struct -x -o task_struct | grep wake_q
[0x860] struct wake_q_node wake_q;
[0x868] struct wake_q_node wake_q_sleeper;

The relevant part of the stack of "sysmon" is:

crash> bt -FFxs 842
[...]
#2 [ffffb0f6808e7c80] ___preempt_schedule+0x16 at ffffffffaa201bb6
ffffb0f6808e7c88: [ffff9ca764ee90bd:kmalloc-4k] ffffb0f6808e7c20
ffffb0f6808e7c98: 0000000000027340 fair_sched_class
ffffb0f6808e7ca8: 0000000000000003 0000000000000000
ffffb0f6808e7cb8: 0000000000000000 0000000000000202
ffffb0f6808e7cc8: [ffff9ca77e9a8028:task_struct] ffffb0f6808e7d08
ffffb0f6808e7cd8: rt_mutex_postunlock+43
#3 [ffffb0f6808e7cd8] rt_mutex_postunlock+0x2b at ffffffffaa309f6b
ffffb0f6808e7ce0: [ffff9ca762400000:sighand_cache] rt_mutex_futex_unlock+172
#4 [ffffb0f6808e7ce8] rt_mutex_futex_unlock+0xac at ffffffffaa94ea7c
ffffb0f6808e7cf0: 0000000000000246 0000000000000001
ffffb0f6808e7d00: ffffb0f6808e7cf8 [ffff9ca77e9a8868:task_struct]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ffffb0f6808e7d10: [ffff9ca77e9a8868:task_struct] 2c93096daf3bc600
ffffb0f6808e7d20: [ffff9ca77e934908:seq_file] 0000000000000000
ffffb0f6808e7d30: rt_spin_unlock+62

and if we look at the task struct via the offset to wake_q_sleeper:

crash> struct task_struct -l 0x868 ffff9ca77e9a8868 | grep -e pid
pid = 16213,

So, the newly created kthread, PID 16213, was on the
wake_q_sleeper queue related to its own &task->sighand->siglock


Doubts:

Everything should be good when the kthread was really blocked and waiting for
&task->sighand->siglock.

Is there any other reason why the kthread could end up on the
wake_q_sleeper queue even when it did not take &task->sighand->siglock
directly?

+ RT specifix hacks to avoid priority inversion of processes?
+ ptrace?

Or is it possible that the following code takes task->sighand->siglock
on RT kernel?

static int kthread(void *_create)
{
[...]
/* OK, tell user we're spawned, wait for stop or wakeup */
__set_current_state(TASK_UNINTERRUPTIBLE);
create->result = current;
/*
* Thread is going to call schedule(), do not preempt it,
* or the creator may spend more time in wait_task_inactive().
*/
preempt_disable();
complete(done);
schedule_preempt_disabled();
[...]
}

Some more info:

The last running times are:

crash> ps -l
[115147050548884] [RU] PID: 16127 TASK: ffff9ca77e9abb00 CPU: 0 COMMAND: "kworker/0:0H"
[115147050481426] [IN] PID: 1829 TASK: ffff9ca71f581d80 CPU: 0 COMMAND: "vReqJob"
[115147050467716] [IN] PID: 704 TASK: ffff9ca7624e0ec0 CPU: 0 COMMAND: "irq/127-eth0-Tx"
[115147050456263] [RU] PID: 16213 TASK: ffff9ca77e9a8000 CPU: 1 COMMAND: "kworker/0:2H"
[115147050394683] [IN] PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob"
[115147050370507] [IN] PID: 2146 TASK: ffff9ca71dcc6740 CPU: 1 COMMAND: "vReqJob"
[115147050363317] [RU] PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon"
[115147050334723] [IN] PID: 1875 TASK: ffff9ca71f6c1d80 CPU: 1 COMMAND: "vReqJob"
[115147050287774] [IN] PID: 705 TASK: ffff9ca7624e0000 CPU: 1 COMMAND: "irq/128-eth0-Tx"
[115147050279924] [RU] PID: 2 TASK: ffff9ca700381d80 CPU: 0 COMMAND: "kthreadd"

I am not sure what vRegJob tasks do. The new worker was scheduled on
CPU1 right after vReqJob. PID 2035. In theory, the process might
have triggered the rescheduling as well. But I do not see it:

crash> bt 2035
PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob"
#0 [ffffb0f68300fbb0] __schedule at ffffffffaa94bef1
#1 [ffffb0f68300fc40] schedule at ffffffffaa94c4e6
#2 [ffffb0f68300fc50] futex_wait_queue_me at ffffffffaa33e2c0
#3 [ffffb0f68300fc88] futex_wait at ffffffffaa33f199
#4 [ffffb0f68300fd98] do_futex at ffffffffaa341062
#5 [ffffb0f68300fe70] __x64_sys_futex at ffffffffaa341a7e
#6 [ffffb0f68300fee0] do_syscall_64 at ffffffffaa2025f0
#7 [ffffb0f68300ff50] entry_SYSCALL_64_after_hwframe at ffffffffaaa0008c
RIP: 00007f7317e0487d RSP: 00007f7208777cd0 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007f72f00209d0 RCX: 00007f7317e0487d
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f72f00209fc
RBP: 0000000000000000 R8: 0000000000000000 R9: 00007f72f800eef0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f72f0020998
R13: 000000001df92377 R14: 0000000000000000 R15: 00007f72f00209fc
ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b

>From the stack, I do not think that it is related:

#3 [ffffb0f68300fc88] futex_wait+0xd9 at ffffffffaa33f199
ffffb0f68300fc90: 00007f72f00209fc ffffffff00000000
ffffb0f68300fca0: ffff9ca77e4eec00 0000000000000001
ffffb0f68300fcb0: 4063c3da32ce9400 ffffb0f68300fe68
ffffb0f68300fcc0: ffffb0f68300fe40 ffffb0f68300fe68
ffffb0f68300fcd0: ffffb0f68300fe40 [ffff9ca762a0ec00:sock_inode_cache]
ffffb0f68300fce0: 0000000000004000 ___sys_sendmsg+153
ffffb0f68300fcf0: 0000000000000064 ffffb0f68300fcf8
ffffb0f68300fd00: ffffb0f68300fcf8 ffff9ca77e4eec38
ffffb0f68300fd10: ffffb0f682befd08 [ffff9ca71e252c40:task_struct]
ffffb0f68300fd20: ffff9ca77e4eec08 [ffff9ca764b2b400:mm_struct]
ffffb0f68300fd30: 00007f72f0020000 00000000000009fc
ffffb0f68300fd40: 0000000000000000 0000000000000000
ffffb0f68300fd50: 0000000000000000 00000000ffffffff
ffffb0f68300fd60: 4063c3da32ce9400 0000000000000000
ffffb0f68300fd70: 0000000000000000 0000000000000000
ffffb0f68300fd80: 0000000000000000 00007f72f800eef0
ffffb0f68300fd90: 00007f72f00209fc do_futex+322
#4 [ffffb0f68300fd98] do_futex+0x142 at ffffffffaa341062
ffffb0f68300fda0: __switch_to_asm+52 __switch_to_asm+64
ffffb0f68300fdb0: __switch_to_asm+52 __switch_to_asm+64
ffffb0f68300fdc0: __switch_to_asm+52 __switch_to_asm+64
ffffb0f68300fdd0: __switch_to_asm+52 __switch_to_asm+64
ffffb0f68300fde0: audit_filter_rules.constprop.22+583 [ffff9ca7632b1e40:cred_jar]
ffffb0f68300fdf0: __switch_to_asm+52 [ffff9ca71e252c40:task_struct]
ffffb0f68300fe00: __switch_to_asm+52 0000000132ce9400
ffffb0f68300fe10: ffffb0f68300fe6c ffffb0f68300fe60
ffffb0f68300fe20: 0000000000004000 0000000000000000
ffffb0f68300fe30: auditd_test_task+56 4063c3da32ce9400
ffffb0f68300fe40: 0000000000000080 0000000000000000
ffffb0f68300fe50: 00007f72f00209fc 0000000000000000
ffffb0f68300fe60: 00007f72f800eef0 0000000000000000
ffffb0f68300fe70: __x64_sys_futex+94
#5 [ffffb0f68300fe70] __x64_sys_futex+0x5e at ffffffffaa341a7e
ffffb0f68300fe78: 00007f72ffffffff 0000000000000000
ffffb0f68300fe88: auditd_test_task+56 [ffff9ca71e246400:kmalloc-1k]
ffffb0f68300fe98: 00000000000000ca __audit_syscall_entry+235
ffffb0f68300fea8: 4063c3da32ce9400 00000000000000ca
ffffb0f68300feb8: ffffb0f68300ff58 00000000000000ca
ffffb0f68300fec8: 0000000000000000 0000000000000000
ffffb0f68300fed8: 0000000000000000 do_syscall_64+128


Best Regards,
Petr