Re: [GIT PULL] pidfd updates

From: Christian Brauner
Date: Tue Apr 25 2023 - 08:34:34 EST


On Tue, Apr 25, 2023 at 07:04:27AM +0100, Al Viro wrote:
> On Mon, Apr 24, 2023 at 01:24:24PM -0700, Linus Torvalds wrote:
>
> > But I really think a potentially much nicer model would have been to
> > extend our "get_unused_fd_flags()" model.
> >
> > IOW, we could have instead marked the 'struct file *' in the file
> > descriptor table as being "not ready yet".
> >
> > I wonder how nasty it would have been to have the low bit of the
> > 'struct file *' mark "not ready to be used yet" or something similar.
> > You already can't just access the 'fdt->fd[]' array willy-nilly since
> > we have both normal RCU issues _and_ the somewhat unusual spectre
> > array indexing issues.
> >
> > So looking around with
> >
> > git grep -e '->fd\['
> >
> > we seem to be pretty good about that and it probably wouldn't be too
> > horrid to add a "check low bit isn't set" to the rules.
> >
> > Then pidfd_prepare() could actually install the file pointer in the fd
> > table, just marked as "not ready", and then instead of "fd_install()",
> > yuo'd have "fd_expose(fd)" or something.
> >
> > I dislike interfaces that return two different things. Particularly
> > ones that are supposed to be there to make things easy for the user. I
> > think your pidfd_prepare() helper fails that "make it easy to use"
> > test.
> >
> > Hmm?
>
> I'm not fond of "return two things" kind of helpers, but I'm even less
> fond of "return fd, file is already there" ones, TBH. {__,}pidfd_prepare()
> users are thankfully very limited in the things they do to the file that
> had been returned, but that really invites abuse.

It's only exposed to kernel core code for good reasons.

>
> The deeper in call chain we mess with descriptor table, the more painful it
> gets, IME.
>
> Speaking of {__,}pidfd_prepare(), I wonder if we wouldn't be better off
> with get_unused_fd_flags() lifted into the callers - all three of those
> (fanotify copy_event_to_user(), copy_process() and pidfd_create()).
> Switch from anon_inode_getfd() to anon_inode_getfile() certainly
> made sense, ditto for combining it with get_pid(), but mixing
> get_unused_fd_flags() into that is a mistake, IMO.

I agree with mostly everything here except for get_unused_fd_flags()
being lifted into the callers. That's what I tried to get rid of in
kernel/fork.c.

It is rife with misunderstandings just looking at what we did in
kernel/fork.c earlier:

retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
[...]
pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
O_RDWR | O_CLOEXEC);

seeing where both get_unused_fd_flags() and both *_getfile() take flag
arguments. Sure, for us this is pretty straightforward since we've seen
that code a million times. For others it's confusing why there's two
flag arguments. Sure, we could use one flags argument but it still be
weird to look at.

But with this api we also force all users to remember that they need to
cleanup the fd and the file - but definitely one - depending on where
they fail.

But conceptually the fd and the file belong together. After all it's the
file we're about to make that reserved fd refer to.

But I'm not here to lambast this api. It works nicely overall and that
reserve + install model is pretty elegant. But see for my boring
compromise proposal below...

>
> As for your suggestion... let's see what it leads to.
>
> Suppose we add such entries (reserved, hold a reference to file,
> marked "not yet available" in the LSB). From the current tree POV those
> would be equivalent to descriptor already reserved, but fd_install() not
> done. So behaviour of existing primitives should be the same as for this
> situation, except for fd_install() and put_unused_fd().
>
> * pick_file(), __fget_files_rcu(), iterate_fd(), files_lookup_fd_raw(),
> loop in dup_fd(), io_close() - treat odd pointers as NULL.
> * close_files() should, AFAICS, treat an odd pointer as "should never
> happen" (and that xchg() in there needs to go anyway - it's pointless, since
> we are freeing the the array immediately afterwards.
> * do_close_on_exec() should probably treat them as "should never happen".
> * do_dup2() - odd value should be treated as -EBUSY.
>
> The interesting part, of course, is how to legitimize (or dispose of) such
> a beast. The former is your "fd_expose()" - parallel to fd_install(),
> AFAICS. The latter... another primitive that would
> grab ->files_lock
> pick_file() variant that *expects* an odd value
> drop ->files_lock
> clear LSB and pass to fput().
>
> It's doable, but AFAICS doesn't make callers all that happier...

In the context of using pidfd for some networking stuff we had a similar
discussion because it's the same problem only worse. Think of a
scenario where you need to allocate the fd and file early on and
multiple function calls later you only get to install fd and file. In
that case you need to drag that fd and file around everywhere so you can
then fd_install it... Sure, we could do this "semi-install fd into
fdtable thing" but I think that's too much subtlety and the fdtable is
traumatic enough as it is.

But what about something like the following where we just expose a very
barebones api that allows and encourages callers to bundle fd and file.

Hell, you could even extend that proposal below to wrap the
put_user()...

struct fd_file {
struct file *file;
int fd;
int __user *fd_user;
};

and

static inline int fd_publish_user(struct fd_file *fdf)
{
int ret = 0;

if (fdf->fd_user)
ret = put_user(fdf->fd, fdf->fd_user);

if (ret)
fd_discard(fdf)
else
fd_publish(fdf)

return 0;
}

which is also a pretty common pattern...