Re: [PATCH v14] exec: Fix dead-lock in de_thread with ptrace_attach

From: Bernd Edlinger
Date: Tue Jan 23 2024 - 13:30:20 EST


On 1/22/24 22:30, Kees Cook wrote:
> On Mon, Jan 22, 2024 at 02:24:37PM +0100, Bernd Edlinger wrote:
>> The main concern was when a set-suid program is executed by execve.
>> Then it makes a difference if the current thread is traced before the
>> execve or not. That means if the current thread is already traced,
>> the decision, which credentials will be used is different than otherwise.
>>
>> So currently there are two possbilities, either the trace happens
>> before the execve, and the suid-bit will be ignored, or the trace
>> happens after the execve, but it is checked that the now potentially
>> more privileged credentials allow the tracer to proceed.
>>
>> With this patch we will have a third prossibility, that is in order
>> to avoid the possible dead-lock we allow the suid-bit to take effect,
>> but only if the tracer's privileges allow both to attach the current
>> credentials and the new credentials. But I would only do that as
>> a last resort, to avoid the possible dead-lock, and not unless a dead-lock
>> is really expected to happen.
>
> Instead of doing this special cred check (which I am worried could
> become fragile -- I'd prefer all privilege checks happen in the same
> place and in the same way...), could we just fail the ptrace_attach of
> the execve?
>

Hmm, yes. That is also possible, and that was actually my first approach,
but I think the current patch is superior. I have nevertheless tried it
again, to get a better picture of the differences between those two approaches.

See below for how that alternative approach would look like:
+ the advantage of that would be simplicity.
+ it avoids the dead-lock in the tracer.
- it is an API change, which we normally try to avoid.
- the adjusted test case(s) show that the tracer cannot successfully
attach to the resulting process before the execve'd process starts up.
So although there is no suid process involved in my test cases,
the traced program simply escapes out of the tracer's control.

The advantage of the current approach would be:
+ it avoids the dead-lock in the tracer
+ it avoids a potentially breaking API change.
+ the tracer is normally able to successfully attach to the
resulting process after the execve completes, before it starts
to execute.
+ the debug experience is just better.
- worst case that can happen, is that the security policy denies the
tracer the access to the new process after the execve. In that case
the PTRACE_ATTACH will fail each time it is attempted, in a similar way
as the the alternate approach. But the overall result is still correct.
The privileged process escapes, and that is okay in that case.
- it is theoretically possible that the security engine gets confused by
the additional call to security_ptrace_access_check, but that will be
something that can be fixed, when it happens.

However my main motivation, why I started this work was the security implication.

I assume the tracer is a privileged process, like an anti virus program,
that supervises all processes and if it detects some anomaly it can
ptrace attach to the target, check what it does and prevent it from doing
bad things.

- Currently a non-privileged program can potentially send such a privileged
tracer into a deadlock.
- With the alternative patch below that non-privileged can no longer send the
tracer into a deadlock, but it can still quickly escape out of the tracer's
control.
- But with my latest patch a sufficiently privileged tracer can neither be
sent into a deadlock nor can the attached process escape. Mission completed.


Thanks
Bernd.


Here is the alternative patch for reference:
diff --git a/fs/exec.c b/fs/exec.c
index e88249a1ce07..0a948f5821b7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1045,6 +1045,8 @@ static int de_thread(struct task_struct *tsk)
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
+ struct task_struct *t;
+ bool unsafe_execve_in_progress = false;

if (thread_group_empty(tsk))
goto no_thread_group;
@@ -1067,6 +1069,18 @@ static int de_thread(struct task_struct *tsk)
if (!thread_group_leader(tsk))
sig->notify_count--;

+ for_other_threads(tsk, t) {
+ if (unlikely(t->ptrace)
+ && (t != tsk->group_leader || !t->exit_state))
+ unsafe_execve_in_progress = true;
+ }
+
+ if (unlikely(unsafe_execve_in_progress)) {
+ spin_unlock_irq(lock);
+ mutex_unlock(&sig->cred_guard_mutex);
+ spin_lock_irq(lock);
+ }
+
while (sig->notify_count) {
__set_current_state(TASK_KILLABLE);
spin_unlock_irq(lock);
@@ -1157,6 +1171,9 @@ static int de_thread(struct task_struct *tsk)
release_task(leader);
}

+ if (unlikely(unsafe_execve_in_progress))
+ mutex_lock(&sig->cred_guard_mutex);
+
sig->group_exec_task = NULL;
sig->notify_count = 0;

@@ -1168,6 +1185,9 @@ static int de_thread(struct task_struct *tsk)
return 0;

killed:
+ if (unlikely(unsafe_execve_in_progress))
+ mutex_lock(&sig->cred_guard_mutex);
+
/* protects against exit_notify() and __exit_signal() */
read_lock(&tasklist_lock);
sig->group_exec_task = NULL;
@@ -1479,6 +1499,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
return -ERESTARTNOINTR;

+ if (unlikely(current->signal->group_exec_task)) {
+ mutex_unlock(&current->signal->cred_guard_mutex);
+ return -ERESTARTNOINTR;
+ }
+
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 98a031ac2648..55816320c103 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2785,6 +2785,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
if (rv < 0)
goto out_free;

+ if (unlikely(current->signal->group_exec_task)) {
+ mutex_unlock(&current->signal->cred_guard_mutex);
+ rv = -ERESTARTNOINTR;
+ goto out_free;
+ }
+
rv = security_setprocattr(PROC_I(inode)->op.lsmid,
file->f_path.dentry->d_name.name, page,
count);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 2fabd497d659..162e4c8f7b08 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -444,6 +444,9 @@ static int ptrace_attach(struct task_struct *task, long request,
scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
&task->signal->cred_guard_mutex) {

+ if (unlikely(task->signal->group_exec_task))
+ return -EAGAIN;
+
scoped_guard (task_lock, task) {
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
if (retval)
@@ -491,6 +494,14 @@ static int ptrace_traceme(void)
{
int ret = -EPERM;

+ if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
+ return -ERESTARTNOINTR;
+
+ if (unlikely(current->signal->group_exec_task)) {
+ mutex_unlock(&current->signal->cred_guard_mutex);
+ return -ERESTARTNOINTR;
+ }
+
write_lock_irq(&tasklist_lock);
/* Are we already being traced? */
if (!current->ptrace) {
@@ -506,6 +517,7 @@ static int ptrace_traceme(void)
}
}
write_unlock_irq(&tasklist_lock);
+ mutex_unlock(&current->signal->cred_guard_mutex);

return ret;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index aca7b437882e..6a136d6ddf7c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
* Make sure we cannot change seccomp or nnp state via TSYNC
* while another thread is in the middle of calling exec.
*/
- if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
- mutex_lock_killable(&current->signal->cred_guard_mutex))
- goto out_put_fd;
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+ if (mutex_lock_killable(&current->signal->cred_guard_mutex))
+ goto out_put_fd;
+
+ if (unlikely(current->signal->group_exec_task)) {
+ mutex_unlock(&current->signal->cred_guard_mutex);
+ goto out_put_fd;
+ }
+ }

spin_lock_irq(&current->sighand->siglock);

diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..7a51a350a068 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -14,6 +14,7 @@
#include <signal.h>
#include <unistd.h>
#include <sys/ptrace.h>
+#include <sys/syscall.h>

static void *thread(void *arg)
{
@@ -23,7 +24,7 @@ static void *thread(void *arg)

TEST(vmaccess)
{
- int f, pid = fork();
+ int s, f, pid = fork();
char mm[64];

if (!pid) {
@@ -31,19 +32,42 @@ TEST(vmaccess)

pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
- execlp("true", "true", NULL);
+ execlp("false", "false", NULL);
+ return;
}

sleep(1);
sprintf(mm, "/proc/%d/mem", pid);
+ /* deadlock did happen here */
f = open(mm, O_RDONLY);
ASSERT_GE(f, 0);
close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(f, 0);
+ f = waitpid(-1, &s, WNOHANG);
+ ASSERT_NE(f, -1);
+ ASSERT_NE(f, 0);
+ ASSERT_NE(f, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 0);
+ f = waitpid(-1, &s, 0);
+ ASSERT_EQ(f, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 1);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_EQ(f, -1);
+ ASSERT_EQ(errno, ECHILD);
}

-TEST(attach)
+/*
+ * Same test as previous, except that
+ * we try to ptrace the group leader,
+ * which is about to call execve,
+ * when the other thread is already ptraced.
+ * This exercises the code in de_thread
+ * where it is waiting inside the
+ * while (sig->notify_count) {
+ * loop.
+ */
+TEST(attach1)
{
int s, k, pid = fork();

@@ -52,19 +76,67 @@ TEST(attach)

pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
- execlp("sleep", "sleep", "2", NULL);
+ execlp("false", "false", NULL);
+ return;
}

sleep(1);
+ /* deadlock may happen here */
k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
- ASSERT_EQ(errno, EAGAIN);
ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, EAGAIN);
k = waitpid(-1, &s, WNOHANG);
ASSERT_NE(k, -1);
ASSERT_NE(k, 0);
ASSERT_NE(k, pid);
ASSERT_EQ(WIFEXITED(s), 1);
ASSERT_EQ(WEXITSTATUS(s), 0);
+ k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, EAGAIN);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 1);
+ k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, ESRCH);
+ k = waitpid(-1, NULL, 0);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, ECHILD);
+}
+
+/*
+ * Same test as previous, except that
+ * the group leader is ptraced first,
+ * but this time with PTRACE_O_TRACEEXIT,
+ * and the thread that does execve is
+ * not yet ptraced. This exercises the
+ * code block in de_thread where the
+ * if (!thread_group_leader(tsk)) {
+ * is executed and enters a wait state.
+ */
+static long thread2_tid;
+static void *thread2(void *arg)
+{
+ thread2_tid = syscall(__NR_gettid);
+ sleep(2);
+ execlp("false", "false", NULL);
+ return NULL;
+}
+
+TEST(attach2)
+{
+ int s, k, pid = fork();
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread2, NULL);
+ pthread_join(pt, NULL);
+ return;
+ }
+
sleep(1);
k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
ASSERT_EQ(k, 0);
@@ -72,12 +144,40 @@ TEST(attach)
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
- k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+ k = ptrace(PTRACE_SETOPTIONS, pid, 0L, PTRACE_O_TRACEEXIT);
+ ASSERT_EQ(k, 0);
+ thread2_tid = ptrace(PTRACE_PEEKDATA, pid, &thread2_tid, 0L);
+ ASSERT_NE(thread2_tid, -1);
+ ASSERT_NE(thread2_tid, 0);
+ ASSERT_NE(thread2_tid, pid);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, 0);
+ sleep(2);
+ /* deadlock may happen here */
+ k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, EAGAIN);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+ k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, EAGAIN);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_EQ(k, 0);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
+ k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, EAGAIN);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFEXITED(s), 1);
- ASSERT_EQ(WEXITSTATUS(s), 0);
+ ASSERT_EQ(WEXITSTATUS(s), 1);
+ k = ptrace(PTRACE_ATTACH, thread2_tid, 0L, 0L);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, ESRCH);
k = waitpid(-1, NULL, 0);
ASSERT_EQ(k, -1);
ASSERT_EQ(errno, ECHILD);