Re: [PATCH RFC] signalfd: add support for SFD_TASK

From: Jens Axboe
Date: Wed Nov 27 2019 - 19:41:48 EST


On 11/27/19 4:27 PM, Jann Horn wrote:
On Wed, Nov 27, 2019 at 9:48 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
On 11/27/19 12:23 PM, Jann Horn wrote:
On Wed, Nov 27, 2019 at 6:11 AM Jens Axboe <axboe@xxxxxxxxx> wrote:
I posted this a few weeks back, took another look at it and refined it a
bit. I'd like some input on the viability of this approach.

A new signalfd setup flag is added, SFD_TASK. This is only valid if used
with SFD_CLOEXEC. If set, the task setting up the signalfd descriptor is
remembered in the signalfd context, and will be the one we use for
checking signals in the poll/read handlers in signalfd.

This is needed to make signalfd useful with io_uring and aio, of which
the former in particular has my interest.

I _think_ this is sane. To prevent the case of a task clearing O_CLOEXEC
on the signalfd descriptor, forking, and then exiting, we grab a
reference to the task when we assign it. If that original task exits, we
catch it in signalfd_flush() and ensure waiters are woken up.

Mh... that's not really reliable, because you only get ->flush() from
the last exiting thread (or more precisely, the last exiting task that
shares the files_struct).

What is your goal here? To have a reference to a task without keeping
the entire task_struct around in memory if someone leaks the signalfd
to another process - basically like a weak pointer? If so, you could
store a refcounted reference to "struct pid" instead of a refcounted
reference to the task_struct, and then do the lookup of the
task_struct on ->poll and ->read (similar to what procfs does).

Yeah, I think that works out much better (and cleaner). How about this,
then? Follows your advice and turns it into a struct pid instead. I
don't particularly like the -ESRCH in dequeue and setup, what do you
think? For poll, POLLERR seems like a prudent choice.

-ESRCH may be kinda weird, but I also can't think of anything
better... and it does describe the problem pretty accurately: The task
whose signal state you're trying to inspect is gone. I went through
the list of errnos, and everything else sounded more weird...

Right, that's why I ultimately ended up with -ESRCH. But I'll take that
as concensus :-)

One more thing, though: We'll have to figure out some way to
invalidate the fd when the target goes through execve(), in particular
if it's a setuid execution. Otherwise we'll be able to just steal
signals that were intended for the other task, that's probably not
good.

So we should:
a) prevent using ->wait() on an old signalfd once the task has gone
through execve()
b) kick off all existing waiters
c) most importantly, prevent ->read() on an old signalfd once the
task has gone through execve()

We probably want to avoid using the cred_guard_mutex here, since it is
quite broad and has some deadlocking issues; it might make sense to
put the update of ->self_exec_id in fs/exec.c under something like the
siglock, and then for a) and c) we can check whether the
->self_exec_id changed while holding the siglock, and for b) we can
add a call to signalfd_cleanup() after the ->self_exec_id change.

OK, that seems like one for after the break. Was hoping there'd be a
more trivial way to accomplish that, I'll give it some thought.

+static void signalfd_put_task(struct signalfd_ctx *ctx, struct task_struct *tsk)
+{
+ if (ctx->task_pid)
+ put_task_struct(tsk);
+}
+
+static struct task_struct *signalfd_get_task(struct signalfd_ctx *ctx)
+{
+ if (ctx->task_pid)
+ return get_pid_task(ctx->task_pid, PIDTYPE_PID);
+
+ return current;
+}

This works, and I guess it's a question of coding style... but I'd
kinda prefer to do the refcount operation in both cases, so that the
semantics of the returned reference are simply "holds a reference"
instead of "either holds a reference or borrows from current depending
on ctx->task_pid". But if you feel strongly about it, feel free to
keep it as-is.

I don't feel super strongly about it, but I wanted to avoid adding an
unnecessary get/put of the current task for the existing use cases of
signalfd. So I'll probably just keep it as-is.

- add_wait_queue(&current->sighand->signalfd_wqh, &wait);
+ add_wait_queue(&tsk->sighand->signalfd_wqh, &wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
- ret = dequeue_signal(current, &ctx->sigmask, info);
+ ret = dequeue_signal(tsk, &ctx->sigmask, info);
if (ret != 0)
break;
if (signal_pending(current)) {
ret = -ERESTARTSYS;
break;
}
- spin_unlock_irq(&current->sighand->siglock);
+ spin_unlock_irq(&tsk->sighand->siglock);
schedule();

Should we be dropping the reference to the task before schedule() and
re-acquiring it afterwards so that if we're blocked on a signalfd read
and then the corresponding task dies, the refcount can drop to zero
and we can get woken up? Probably doesn't matter, but seems a bit
cleaner to me.

That would be simple enough to do, as we know that tsk is either still
the same, or we need to abort. Hence no need to fiddle waitqueues at
that point. I'll make that change.

- spin_lock_irq(&current->sighand->siglock);
+ spin_lock_irq(&tsk->sighand->siglock);
}
- spin_unlock_irq(&current->sighand->siglock);
+ spin_unlock_irq(&tsk->sighand->siglock);

- remove_wait_queue(&current->sighand->signalfd_wqh, &wait);
+ remove_wait_queue(&tsk->sighand->signalfd_wqh, &wait);
__set_current_state(TASK_RUNNING);

+ signalfd_put_task(ctx, tsk);
return ret;
}

@@ -267,19 +296,24 @@ static int do_signalfd4(int ufd, sigset_t *mask, int flags)
/* Check the SFD_* constants for consistency. */
BUILD_BUG_ON(SFD_CLOEXEC != O_CLOEXEC);
BUILD_BUG_ON(SFD_NONBLOCK != O_NONBLOCK);
+ BUILD_BUG_ON(SFD_TASK & (SFD_CLOEXEC | SFD_NONBLOCK));

- if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK))
+ if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK | SFD_TASK))
+ return -EINVAL;
+ if ((flags & (SFD_CLOEXEC | SFD_TASK)) == SFD_TASK)
return -EINVAL;

(non-actionable comment: It seems kinda weird that you can specify
these parameters with no effect for the `uffd != -1` case... but since
the existing parameters already work that way, I guess it's
consistent.)

Yeah, just following what it already does, though I do agree it is weird
with the two separate cases and it only impacting one of them. Didn't
want to make it behave differently.

--
Jens Axboe