Re: [PATCH v2 1/2] fs, close_range: add flag CLOSE_RANGE_CLOEXEC

From: Christian Brauner
Date: Tue Oct 20 2020 - 10:26:38 EST


On Mon, Oct 19, 2020 at 12:26:53PM +0200, Giuseppe Scrivano wrote:
> When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't
> immediately close the files but it sets the close-on-exec bit.
>
> It is useful for e.g. container runtimes that usually install a
> seccomp profile "as late as possible" before execv'ing the container
> process itself. The container runtime could either do:
> 1 2
> - install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0);
> - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile();
> - execve(...); - execve(...);
>
> Both alternative have some disadvantages.
>
> In the first variant the seccomp_profile cannot block the close_range
> syscall, as well as opendir/read/close/... for the fallback on older
> kernels).
> In the second variant, close_range() can be used only on the fds
> that are not going to be needed by the runtime anymore, and it must be
> potentially called multiple times to account for the different ranges
> that must be closed.
>
> Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues.
> The runtime is able to use the open fds and the seccomp profile could
> block close_range() and the syscalls used for its fallback.

I see, so you want those fds to be closed after exec but still use them
before. Yeah, this is a good use-case. (I proposed this extension quite a
while ago when we started discussing this syscall. Thanks for working
ont this!)

>
> Signed-off-by: Giuseppe Scrivano <gscrivan@xxxxxxxxxx>
> ---
> fs/file.c | 44 ++++++++++++++++++++++++--------
> include/uapi/linux/close_range.h | 3 +++
> 2 files changed, 37 insertions(+), 10 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 21c0893f2f1d..0295d4f7c5ef 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -672,6 +672,35 @@ int __close_fd(struct files_struct *files, unsigned fd)
> }
> EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
>
> +static inline void __range_cloexec(struct files_struct *cur_fds,
> + unsigned int fd, unsigned int max_fd)
> +{
> + struct fdtable *fdt;
> +
> + if (fd > max_fd)
> + return;

Looks like formatting issues here.

> +
> + spin_lock(&cur_fds->file_lock);
> + fdt = files_fdtable(cur_fds);
> + bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1);

I think that this is ok and that there's no reason to make this anymore
complex unless we somehow really see performance issues which I doubt.

If Al is ok with doing it this way and doesn't see any obvious issues
I'll be taking this for some testing and would come back to ack this and
pick it up.

> + spin_unlock(&cur_fds->file_lock);
> +}
> +
> +static inline void __range_close(struct files_struct *cur_fds, unsigned int fd,
> + unsigned int max_fd)
> +{
> + while (fd <= max_fd) {
> + struct file *file;
> +
> + file = pick_file(cur_fds, fd++);
> + if (!file)
> + continue;
> +
> + filp_close(file, cur_fds);
> + cond_resched();
> + }
> +}
> +
> /**
> * __close_range() - Close all file descriptors in a given range.
> *
> @@ -687,7 +716,7 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
> struct task_struct *me = current;
> struct files_struct *cur_fds = me->files, *fds = NULL;
>
> - if (flags & ~CLOSE_RANGE_UNSHARE)
> + if (flags & ~(CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC))
> return -EINVAL;
>
> if (fd > max_fd)
> @@ -725,16 +754,11 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
> }
>
> max_fd = min(max_fd, cur_max);
> - while (fd <= max_fd) {
> - struct file *file;
>
> - file = pick_file(cur_fds, fd++);
> - if (!file)
> - continue;
> -
> - filp_close(file, cur_fds);
> - cond_resched();
> - }
> + if (flags & CLOSE_RANGE_CLOEXEC)
> + __range_cloexec(cur_fds, fd, max_fd);
> + else
> + __range_close(cur_fds, fd, max_fd);
>
> if (fds) {
> /*
> diff --git a/include/uapi/linux/close_range.h b/include/uapi/linux/close_range.h
> index 6928a9fdee3c..2d804281554c 100644
> --- a/include/uapi/linux/close_range.h
> +++ b/include/uapi/linux/close_range.h
> @@ -5,5 +5,8 @@
> /* Unshare the file descriptor table before closing file descriptors. */
> #define CLOSE_RANGE_UNSHARE (1U << 1)
>
> +/* Set the FD_CLOEXEC bit instead of closing the file descriptor. */
> +#define CLOSE_RANGE_CLOEXEC (1U << 2)
> +
> #endif /* _UAPI_LINUX_CLOSE_RANGE_H */
>
> --
> 2.26.2
>
> _______________________________________________
> Containers mailing list
> Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
> https://lists.linuxfoundation.org/mailman/listinfo/containers