Re: [GIT PULL] pidfd updates

From: Christian Brauner
Date: Tue Apr 25 2023 - 10:37:15 EST


On Tue, Apr 25, 2023 at 02:54:29PM +0100, Al Viro wrote:
> On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:
>
> > It is rife with misunderstandings just looking at what we did in
> > kernel/fork.c earlier:
> >
> > retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
> > [...]
> > pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
> > O_RDWR | O_CLOEXEC);
> >
> > seeing where both get_unused_fd_flags() and both *_getfile() take flag
> > arguments. Sure, for us this is pretty straightforward since we've seen
> > that code a million times. For others it's confusing why there's two
> > flag arguments. Sure, we could use one flags argument but it still be
> > weird to look at.
>
> First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and
> anon_inode_getfile() - about O_CLOEXEC. Duplicating the expression in
> places like that is cargo-culting.

I distinctly remember us having that conversation about how to do this
nicely back then and fwiw this is your patch... ;)
6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups")

So sure, that was my point: people are confused why there's two flag
arguments and what exactly has to go into them and just copy-paste
whatever other users already have.

>
> > But with this api we also force all users to remember that they need to
> > cleanup the fd and the file - but definitely one - depending on where
> > they fail.
> >
> > But conceptually the fd and the file belong together. After all it's the
> > file we're about to make that reserved fd refer to.
>
> Why? The common pattern is
> * reserve the descriptor
> * do some work that might fail
> * get struct file set up (which also might fail)
> * do more work (possibly including copyout, etc.)
> * install file into descriptor table
>
> We want to reserve the descriptor early, since it's about the easiest
> thing to undo. Why bother doing it next to file setup? That can be
> pretty deep in call chain and doing it that way comes with headache
> about passing the damn thing around. And cleanup rules with your
> variant end up being more complicated.
>
> Keep the manipulations of descriptor table as close to the surface
> as possible. The real work is with file references; descriptors
> belong in userland and passing them around kernel-side is asking
> for headache.

Imho, there's different use-cases.

There's definitely one where people allocate a file descriptor early on
and then sometimes maybe way later allocate a struct file and install
it. And that's where exposing and using get_unused_fd_flags() and
fd_install() is great and works fine.

But there's also users that do the reserve an fd and allocate a file at
the same time or receive both at the same time as the seccomp notifier,
scm creds, or pidfds. The receive_fd* stuff is basically a version of
the sketch I did without the ability to simply use a struct and not
having to open-code everything multiple times.

I never expected excitement around this as it was difficult enough to
get that receive_fd* thing going so I'm not arguing for this to be the
ultima ratio...