Re: [PATCH v2 0/5] pid: add pidfd_open()

From: Jonathan Kowalski
Date: Mon Apr 01 2019 - 12:27:47 EST


On Mon, Apr 1, 2019 at 5:15 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Apr 1, 2019 at 9:07 AM Jonathan Kowalski <bl0pbl33p@xxxxxxxxx> wrote:
> >
> > With the POLLHUP model on a simple pidfd, you'd know when the process
> > you were referring to is dead (and one can map POLLPRI to dead and
> > POLLHUP to zombie, etc).
>
> Adding ->poll() to the pidfd should be easy. Again, it would be
> trivially be made to work for the directory fd you get from
> /proc/<pid> too.
>
> Yeah, yeah, pollable directories are odd, but the vfs layer doesn't
> care about things like "is this a directory or not". It will just call
> the f_op->poll() method.

I know, Andy even sent a patch for that long back. The question is,
this sure solves the immediate usecase, but it inhibits some very
powerful (and natural) things from being realised in the future, and
makes some choices harder.

Currently, pidfd_send_signal doesn't work across PID namespaces. It
would be possible to make it work, but some things need to be taken
care of, precisely, that one allows a task to open pidfds for tasks
*it* can see. Why? because you essentially isolate the PID namespace,
so your open() for this namespace suddenly doesn't start opening
things it cannot see through some other namespace (i.e. /proc),
precisely how you cannot open sockets in network namespaces from the
outside, though if you can setns, you should be able to (same with
pidfds). This makes for a nice delegation model, I can essentially put
a task in a namespace with no other tasks, keep pushing pidfds into
the damn thing, and subject to kernel permissions and capabilities it
can yield in the owner userns, signal the said task. You can extend
this to ptrace and other things, by making them accept a pidfd. This
means userspace has to explicitly pass such descriptors around to make
this work, like it does today (and how I can use an open socket and
accept connections whilst living in totally isolated network
namespace).

Besides that, /proc comes with too much stuff, it should be possible
to go from pidfd to /proc/<PID> and do whatever you wish to, but
atleast two things that require varying levels of capabilities of
inspection, the latter of which can be isolated by mount namespaces
even if the process would usially be allowed to peek into it and read
the entire thing, do not end being munged together. I can choose to
pass both, but if /proc dir fds *are* pidfds, you need the entire
complexity of masking and whatnot (which would be usable on its own,
no doubt), making directory descriptors pollable and readable, etc
etc.

>
> Linus