Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach

From: Bernd Edlinger
Date: Mon Jan 22 2024 - 08:23:55 EST


On 1/17/24 17:38, Oleg Nesterov wrote:
> On 01/17, Bernd Edlinger wrote:
>> Yes. but the tracer has to do its job, and that is ptrace_attach the
>> remaining treads, it does not know that it would avoid a dead-lock
>> when it calls wait(), instead of ptrace_attach. It does not know
>> that the tracee has just called execve in one of the not yet traced
>> threads.
>
> Hmm. I don't understand you.

Certainly I am willing to rephrase this until it is understandable.
Probably I have just not yet found the proper way to describe the issue
here, and your help in resolving that documentation issue is very important
to me.

>
> I agree we have a problem which should be fixed. Just the changelog
> looks confusing to me, imo it doesn't explain the race/problem clearly.
>

I am trying here to summarize what the test case "attach" in
/tools/testing/selftests/ptrace/vmaccess.c does.

I think it models the use case of a tracer that is trying to attach
to a multi-threaded process that is executing execve in a not-yet
traced thread while a different sub-thread is already traced,
it is not relevant that the test case uses PTRACE_TRACEME, to make
the sub-thead traced, the same would happen if the tracer uses
some out-of-band mechanism like /proc/pid/task to learn the thread_id
of the sub-threads and uses ptrace_attach to each of them.

The test case hits the dead-lock because there is a race condition
between before the PTRACE_ATTACH, and it cannot know that the
exit event from the sub-thread is already pending before the
PTRACE_ATTACH. Of course a real tracer will not sleep a whole
second before a PTRACE_ATTACH, but even if it does a
waitpid(-1, &s, WNOHANG) immediately before the PTRACE_ATTACH
there is a tiny chance that the execve is entered just immediately
after waitpid has indicated that there is currently not
event pending.

>>>> + if (unlikely(t->ptrace)
>>>> + && (t != tsk->group_leader || !t->exit_state))
>>>> + unsafe_execve_in_progress = true;
>>>
>>> The !t->exit_state is not right... This sub-thread can already be a zombie
>>> with ->exit_state != 0 but see above, it won't be reaped until the debugger
>>> does wait().
>>>
>>
>> I dont think so.
>> de_thread() handles the group_leader different than normal threads.
>
> I don't follow...
>
> I didn't say that t is a group leader. I said it can be a zombie sub-thread
> with ->exit_state != 0.
>

the condition here is

(t != tsk->group_leader || !t->exit_state)

so in other words, if t is a sub-thread, i.e. t != tsk->group_leader
then the t->exit_state does not count, and the deadlock is possible.

But if t it is a group leader, then t == tsk->group_leader, but a
deadlock is only possible when t->exit_state == 0 at this time.
The most likely reason for this is PTRACE_O_TRACEEXIT.

I will add a new test case that demonstrates this in the next iteration
of this patch. Here is a preview of what I have right now:

/*
* Same test as previous, except that
* the group leader is ptraced first,
* but this time with PTRACE_O_TRACEEXIT,
* and the thread that does execve is
* not yet ptraced. This exercises the
* code block in de_thread where the
* if (!thread_group_leader(tsk)) {
* is executed and enters a wait state.
*/
static long thread2_tid;
static void *thread2(void *arg)
{
thread2_tid = syscall(__NR_gettid);
sleep(2);
execlp("false", "false", NULL);
return NULL;
}

TEST(attach2)
{
int s, k, pid = fork();

if (!pid) {
pthread_t pt;

pthread_create(&pt, NULL, thread2, NULL);
pthread_join(pt, NULL);
return;
}

sleep(1);
k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
ASSERT_EQ(k, 0);
thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
ASSERT_NE(thread2_tid, -1);
ASSERT_NE(thread2_tid, 0);
ASSERT_NE(thread2_tid, pid);
k = waitpid(-1, &s, WNOHANG);
ASSERT_EQ(k, 0);
sleep(2);
/* deadlock may happen here */
k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, WNOHANG);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
k = waitpid(-1, &s, WNOHANG);
ASSERT_EQ(k, 0);
k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
k = waitpid(-1, &s, WNOHANG);
ASSERT_EQ(k, 0);
k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
k = waitpid(-1, &s, WNOHANG);
ASSERT_EQ(k, 0);
k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFEXITED(s), 1);
ASSERT_EQ(WEXITSTATUS(s), 1);
k = waitpid(-1, NULL, 0);
ASSERT_EQ(k, -1);
ASSERT_EQ(errno, ECHILD);
}

So the traced process does the execve in the sub-thread,
and the tracer attaches the thread leader first, and when
the next PTRACE_ATTACH happens, the thread leader is stopped
because of the PTRACE_O_TACEEXIT. So at that time,
the t->exit_state == 0, and we receive the following:

k = waitpid(-1, &s, WNOHANG);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGTRAP);

yet the de_thread is not finished now, but only when

k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);

happens.

The only thing, that is admittedly pretty confusing here, is
the fact that the thread2_tid morphs into group leader's pid,
at this time, and is therefore never heard of again.

So pid refers to the former thread2_tid from now on, and the
former group leader does not enter the usual zombie state here,
because the sub-thread takes over it's role.


>>>> + if (unlikely(unsafe_execve_in_progress)) {
>>>> + spin_unlock_irq(lock);
>>>> + sig->exec_bprm = bprm;
>>>> + mutex_unlock(&sig->cred_guard_mutex);
>>>> + spin_lock_irq(lock);
>>>
>>> I don't understand why do we need to unlock and lock siglock here...
>>
>> That is just a precaution because I did want to release the
>> mutexes exactly in the reverse order as they were acquired.
>
> To me this adds the unnecessary complication.
>

Well, the proposed change creates this sequence

mutex_lock(&sig_cred_guard_mutex);
spin_lock_irq(lock);
mutex_unlock(&sig_cred_guard_mutex);
spin_unlock_irq(lock);

I wanted to avoid that, because in a usual real-time os,
I'd expect the mutex_unlock to schedule another waiting
task, regardless of the spin lock state.

Are you saying that doing this is safe to do in linux,
because the scheduling does not happen except when
explicitly asked for e.g. by calling schedule() ?
And would that also be safe for real time linux port ?

>>> But my main question is why do we need the unsafe_execve_in_progress boolean.
>>> If this patch is correct and de_thread() can drop and re-acquire cread_guard_mutex
>>> when one of the threads is traced, then why can't we do this unconditionally ?
>>>
>>
>> I just wanted to keep the impact of the change as small as possible,
>
> But the unsafe_execve_in_progress logic increases the impact and complicates
> the patch.
>
> I think the fix should be as simple as possible. (to be honest, right now
> I don't think this is a right approach).
>

The main concern was when a set-suid program is executed by execve.
Then it makes a difference if the current thread is traced before the
execve or not. That means if the current thread is already traced,
the decision, which credentials will be used is different than otherwise.

So currently there are two possbilities, either the trace happens
before the execve, and the suid-bit will be ignored, or the trace
happens after the execve, but it is checked that the now potentially
more privileged credentials allow the tracer to proceed.

With this patch we will have a third prossibility, that is in order
to avoid the possible dead-lock we allow the suid-bit to take effect,
but only if the tracer's privileges allow both to attach the current
credentials and the new credentials. But I would only do that as
a last resort, to avoid the possible dead-lock, and not unless a dead-lock
is really expected to happen.


Thanks
Bernd.