Re: [PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state

From: Alexander Gordeev
Date: Tue Jun 21 2022 - 11:19:02 EST


On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote:
> Alexander Gordeev <agordeev@xxxxxxxxxxxxx> writes:
>
> > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote:
> >> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >>
> >> Currently ptrace_stop() / do_signal_stop() rely on the special states
> >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
> >> state exists only in task->__state and nowhere else.
> >>
> >> There's two spots of bother with this:
> >>
> >> - PREEMPT_RT has task->saved_state which complicates matters,
> >> meaning task_is_{traced,stopped}() needs to check an additional
> >> variable.
> >>
> >> - An alternative freezer implementation that itself relies on a
> >> special TASK state would loose TASK_TRACED/TASK_STOPPED and will
> >> result in misbehaviour.
> >>
> >> As such, add additional state to task->jobctl to track this state
> >> outside of task->__state.
> >>
> >> NOTE: this doesn't actually fix anything yet, just adds extra state.
> >>
> >> --EWB
> >> * didn't add a unnecessary newline in signal.h
> >> * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
> >> instead of in signal_wake_up_state. This prevents the clearing
> >> of TASK_STOPPED and TASK_TRACED from getting lost.
> >> * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared
> >
> > Hi Eric, Peter,
> >
> > On s390 this patch triggers warning at kernel/ptrace.c:272 when
> > kill_child testcase from strace tool is repeatedly used (the source
> > is attached for reference):
> >
> > while :; do
> > strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child
> > done
> >
> > It normally takes few minutes to cause the warning in -rc3, but FWIW
> > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of
> > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.
> >
> > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't
> > fail") suggests this WARN_ON_ONCE() is not really expected, yet we
> > observe a child in __TASK_TRACED state. Could you please comment here?
> >
>
> For clarity the warning is that the child is not in __TASK_TRACED state.
>
> The code is waiting for the code to stop in the scheduler in the
> __TASK_TRACED state so that it can safely read and change the
> processes state. Some of that state is not even saved until the
> process is scheduled out so we have to wait until the process
> is stopped in the scheduler.

So I assume (checked actually) the return 0 below from kernel/sched/core.c:
wait_task_inactive() is where it bails out:

3303 while (task_running(rq, p)) {
3304 if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
3305 return 0;
3306 cpu_relax();
3307 }

Yet, the child task is always found in __TASK_TRACED state (as seen
in crash dumps):

> 101447 11342 13 ce3a8100 RU 0.0 10040 4412 strace
101450 101447 0 bb04b200 TR 0.0 2272 1136 kill_child
108261 101447 2 d0b10100 TR 0.0 2272 532 kill_child
crash> task bb04b200 __state
PID: 101450 TASK: bb04b200 CPU: 0 COMMAND: "kill_child"
__state = 8,

crash> task d0b10100 __state
PID: 108261 TASK: d0b10100 CPU: 2 COMMAND: "kill_child"
__state = 8,

> At least on s390 it looks like there is a race between SIGKILL and
> ptrace_check_attach. That isn't good.
>
> Reading the code below there is something missing because I don't see
> anything making ptrace calls, and ptrace_check_attach (which contains
> the warning) only happens in the ptrace syscall.

That is what I believe strace does when calling that code:

strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child

> Eric
>
>
>
> > Thanks!
> >
> > /*
> > * Check for the corner case that previously lead to segfault
> > * due to an attempt to access unitialised tcp->s_ent.
> > *
> > * 13994 ????( <unfinished ...>
> > * ...
> > * 13994 <... ???? resumed>) = ?
> > *
> > * Copyright (c) 2019 The strace developers.
> > * All rights reserved.
> > *
> > * SPDX-License-Identifier: GPL-2.0-or-later
> > */
> >
> > #include "tests.h"
> >
> > #include <sched.h>
> > #include <signal.h>
> > #include <unistd.h>
> > #include <sys/mman.h>
> > #include <sys/wait.h>
> >
> > #define ITERS 10000
> > #define SC_ITERS 10000
> >
> > int
> > main(void)
> > {
> > volatile sig_atomic_t *const mem =
> > mmap(NULL, get_page_size(), PROT_READ | PROT_WRITE,
> > MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> > if (mem == MAP_FAILED)
> > perror_msg_and_fail("mmap");
> >
> > for (unsigned int i = 0; i < ITERS; ++i) {
> > mem[0] = mem[1] = 0;
> >
> > const pid_t pid = fork();
> > if (pid < 0)
> > perror_msg_and_fail("fork");
> >
> > if (!pid) {
> > /* wait for the parent */
> > while (!mem[0])
> > ;
> > /* let the parent know we are running */
> > mem[1] = 1;
> >
> > for (unsigned int j = 0; j < SC_ITERS; j++)
> > sched_yield();
> >
> > pause();
> > return 0;
> > }
> >
> > /* let the child know we are running */
> > mem[0] = 1;
> > /* wait for the child */
> > while (!mem[1])
> > ;
> >
> > if (kill(pid, SIGKILL))
> > perror_msg_and_fail("kill");
> > if (wait(NULL) != pid)
> > perror_msg_and_fail("wait");
> > }
> >
> > return 0;
> > }