Re: [GIT PULL] pidfd updates

From: Al Viro
Date: Tue Apr 25 2023 - 09:54:38 EST


On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:

> It is rife with misunderstandings just looking at what we did in
> kernel/fork.c earlier:
>
> retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
> [...]
> pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
> O_RDWR | O_CLOEXEC);
>
> seeing where both get_unused_fd_flags() and both *_getfile() take flag
> arguments. Sure, for us this is pretty straightforward since we've seen
> that code a million times. For others it's confusing why there's two
> flag arguments. Sure, we could use one flags argument but it still be
> weird to look at.

First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and
anon_inode_getfile() - about O_CLOEXEC. Duplicating the expression in
places like that is cargo-culting.

> But with this api we also force all users to remember that they need to
> cleanup the fd and the file - but definitely one - depending on where
> they fail.
>
> But conceptually the fd and the file belong together. After all it's the
> file we're about to make that reserved fd refer to.

Why? The common pattern is
* reserve the descriptor
* do some work that might fail
* get struct file set up (which also might fail)
* do more work (possibly including copyout, etc.)
* install file into descriptor table

We want to reserve the descriptor early, since it's about the easiest
thing to undo. Why bother doing it next to file setup? That can be
pretty deep in call chain and doing it that way comes with headache
about passing the damn thing around. And cleanup rules with your
variant end up being more complicated.

Keep the manipulations of descriptor table as close to the surface
as possible. The real work is with file references; descriptors
belong in userland and passing them around kernel-side is asking
for headache.