Re: [PATCH v1 1/2] pid: add pidfd_open()

From: Christian Brauner
Date: Thu May 16 2019 - 10:59:21 EST


On Thu, May 16, 2019 at 04:27:00PM +0200, Oleg Nesterov wrote:
> On 05/16, Christian Brauner wrote:
> >
> > With the introduction of pidfds through CLONE_PIDFD it is possible to
> > created pidfds at process creation time.
>
> Now I am wondering why do we need CLONE_PIDFD, you can just do
>
> pid = fork();
> pidfd_open(pid);

CLONE_PIDFD eliminates the race at the source and let's us avoid two
syscalls for the sake of one. That'll obviously matter even more when we
enable CLONE_THREAD | CLONE_PIDFD.
pidfd_open() is really just a necessity for anyone who does non-parent
process management aka LMK or service managers.
I also would like to reserve the ability at some point (e.g. with cloneX
or sm) to be able to specify specific additional flags at process
creation time that modify pidfd behavior.

>
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > + int fd, ret;
> > + struct pid *p;
> > + struct task_struct *tsk;
> > +
> > + if (flags)
> > + return -EINVAL;
> > +
> > + if (pid <= 0)
> > + return -EINVAL;
> > +
> > + p = find_get_pid(pid);
> > + if (!p)
> > + return -ESRCH;
> > +
> > + ret = 0;
> > + rcu_read_lock();
> > + /*
> > + * If this returns non-NULL the pid was used as a thread-group
> > + * leader. Note, we race with exec here: If it changes the
> > + * thread-group leader we might return the old leader.
> > + */
> > + tsk = pid_task(p, PIDTYPE_TGID);
> > + if (!tsk)
> > + ret = -ESRCH;
> > + rcu_read_unlock();
> > +
> > + fd = ret ?: pidfd_create(p);
> > + put_pid(p);
> > + return fd;
> > +}
>
> Looks correct, feel free to add Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx>
>
> But why do we need task_struct *tsk?
>
> rcu_read_lock();
> if (!pid_task(PIDTYPE_TGID))
> ret = -ESRCH;
> rcu_read_unlock();

Sure, that's simpler. I'll rework and add your Reviewed-by.

>
> and in fact we do not even need rcu_read_lock(), we could do
>
> // shut up rcu_dereference_check()
> rcu_lock_acquire(&rcu_lock_map);
> if (!pid_task(PIDTYPE_TGID))
> ret = -ESRCH;
> rcu_lock_release(&rcu_lock_map);
>
> Well... I won't insist, but the comment about the race with exec looks a bit
> confusing to me. It is true, but we do not care at all, we are not going to
> use the task_struct returned by pid_task().

Yeah, I can remove it.

Thanks!
Christian