Re: [PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state

From: Eric W. Biederman
Date: Tue Jun 21 2022 - 13:47:51 EST


Alexander Gordeev <agordeev@xxxxxxxxxxxxx> writes:

> On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote:
>> Alexander Gordeev <agordeev@xxxxxxxxxxxxx> writes:
>>
>> > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote:
>> >> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> >>
>> >> Currently ptrace_stop() / do_signal_stop() rely on the special states
>> >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
>> >> state exists only in task->__state and nowhere else.
>> >>
>> >> There's two spots of bother with this:
>> >>
>> >> - PREEMPT_RT has task->saved_state which complicates matters,
>> >> meaning task_is_{traced,stopped}() needs to check an additional
>> >> variable.
>> >>
>> >> - An alternative freezer implementation that itself relies on a
>> >> special TASK state would loose TASK_TRACED/TASK_STOPPED and will
>> >> result in misbehaviour.
>> >>
>> >> As such, add additional state to task->jobctl to track this state
>> >> outside of task->__state.
>> >>
>> >> NOTE: this doesn't actually fix anything yet, just adds extra state.
>> >>
>> >> --EWB
>> >> * didn't add a unnecessary newline in signal.h
>> >> * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up
>> >> instead of in signal_wake_up_state. This prevents the clearing
>> >> of TASK_STOPPED and TASK_TRACED from getting lost.
>> >> * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared
>> >
>> > Hi Eric, Peter,
>> >
>> > On s390 this patch triggers warning at kernel/ptrace.c:272 when
>> > kill_child testcase from strace tool is repeatedly used (the source
>> > is attached for reference):
>> >
>> > while :; do
>> > strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child
>> > done
>> >
>> > It normally takes few minutes to cause the warning in -rc3, but FWIW
>> > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of
>> > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.
>> >
>> > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't
>> > fail") suggests this WARN_ON_ONCE() is not really expected, yet we
>> > observe a child in __TASK_TRACED state. Could you please comment here?
>> >
>>
>> For clarity the warning is that the child is not in __TASK_TRACED state.
>>
>> The code is waiting for the code to stop in the scheduler in the
>> __TASK_TRACED state so that it can safely read and change the
>> processes state. Some of that state is not even saved until the
>> process is scheduled out so we have to wait until the process
>> is stopped in the scheduler.
>
> So I assume (checked actually) the return 0 below from kernel/sched/core.c:
> wait_task_inactive() is where it bails out:
>
> 3303 while (task_running(rq, p)) {
> 3304 if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> 3305 return 0;
> 3306 cpu_relax();
> 3307 }
>
> Yet, the child task is always found in __TASK_TRACED state (as seen
> in crash dumps):
>
>> 101447 11342 13 ce3a8100 RU 0.0 10040 4412 strace
> 101450 101447 0 bb04b200 TR 0.0 2272 1136 kill_child
> 108261 101447 2 d0b10100 TR 0.0 2272 532 kill_child
> crash> task bb04b200 __state
> PID: 101450 TASK: bb04b200 CPU: 0 COMMAND: "kill_child"
> __state = 8,
>
> crash> task d0b10100 __state
> PID: 108261 TASK: d0b10100 CPU: 2 COMMAND: "kill_child"
> __state = 8,

That is weird.

>> At least on s390 it looks like there is a race between SIGKILL and
>> ptrace_check_attach. That isn't good.
>>
>> Reading the code below there is something missing because I don't see
>> anything making ptrace calls, and ptrace_check_attach (which contains
>> the warning) only happens in the ptrace syscall.
>
> That is what I believe strace does when calling that code:
>
> strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child

Thank you. That was my braino.

I will have to see if it reproduces for me on x86 (I don't have an
s390). Perhaps if I can reproduce it I can guess what is going wrong.

So far it appears WARN_ON_ONCE has nothing to warn about yet it is
warning.

Eric