Re: [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC

From: Jeff Xu
Date: Tue Jul 18 2023 - 15:04:39 EST


Hi David

Thanks email and patch for discussion.

On Fri, Jul 14, 2023 at 4:48 AM David Rheinsberg <david@xxxxxxxxxxxx> wrote:
>
> Add a new flag for memfd_create() called MFD_NOEXEC, which has the
> opposite effect as MFD_EXEC (i.e., it strips the executable bits from
> the inode file mode).
>
I previously thought about having the symmetric flags, such as
MFD_NOEXEC/MFD_EXEC/MFD_NOEXEC_SEAL/MFD_EXEC_SEAL, but decided against
it. The app shall decide beforehand what the memfd is created for, if
it is no-executable, then it should be sealed, such that it can't be
chmod to enable "X" bit.

> The default mode for memfd_create() has historically been to use 0777 as
> file modes. The new `MFD_EXEC` flag has now made this explicit, paving
> the way to reduce the default to 0666 and thus dropping the executable
> bits for security reasons. Additionally, the `MFD_NOEXEC_SEAL` flag has
> been added which allows this without changing the default right now.
>
> Unfortunately, `MFD_NOEXEC_SEAL` enforces `MFD_ALLOW_SEALING` and
> `F_SEAL_EXEC`, with no alternatives available. This leads to multiple
> issues:
>
> * Applications that do not want to allow sealing either have to use
> `MFD_EXEC` (which we don't want, unless they actually need the
> executable bits), or they must add `F_SEAL_SEAL` immediately on
> return of memfd_create(2) with `MFD_NOEXEC_SEAL`, since this
> implicitly enables sealing.
>
> Note that we explicitly added `MFD_ALLOW_SEALING` when creating
> memfd_create(2), because enabling seals on existing users of shmem
> without them knowing about it can easily introduce DoS scenarios.

The application that doesn't want MFD_NOEXEC_SEAL can use MFD_EXEC,
the kernel won't add MFD_ALLOW_SEALING implicitly. MFD_EXEC makes the
kernel behave the same as before, this is also why sysctl
vm.memfd_noexec=0 can work seamlessly.

> It
> is unclear why `MFD_NOEXEC_SEAL` was designed to enable seals, and
> this is especially dangerous with `MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL`
> set via sysctl, since it will silently enable sealing on every memfd
> created.
>
Without sealing, chmod(2) can modify the mfd to be executable, that is
the consideration that MFD_NOEXEC is not provided as an option.
Indeed, current design is "biased" towards promoting MFD_NOEXEC_SEAL
as the safest approach, and try to avoid the pitfall that dev
accidently uses "MFD_NOEXEC" without realizing it can still be
chmod().


> * Applications that do not want `MFD_EXEC`, but rely on
> `F_GET_SEALS` to *NOT* return `F_SEAL_EXEC` have no way of achieving
> this other than using `MFD_EXEC` and clearing the executable bits
> manually via fchmod(2). Using `MFD_NOEXEC_SEAL` will set
> `F_SEAL_EXEC` and thus break existing code that hard-codes the
> seal-set.
>
> This is already an issue when sending log-memfds to systemd-journald
> today, which verifies the seal-set of the received FD and fails if
> unknown seals are set. Hence, you have to use `MFD_EXEC` when
> creating memfds for this purpose, even though you really do not need
> the executable bits set.
>
> * Applications that want to enable the executable bit later on,
> currently have no way to create the memfd without it. They have to
> clear the bits immediately after creating it via fchmod(2), or just
> leave them set.
>
Is it OK to do what you want in two steps ? What is the concern there ? i.e.
memfd_create(MFD_EXEC), then chmod to remove the "X" bit.

I imagine this is probably already what the application does for
creating no-executable mfd before my patch, i.e.:
memfd_create(), then chmod() to remove "X" to remove "X" bit.

Thanks!
-Jeff



-Jeff

> By adding MFD_NOEXEC, user-space is now able to clear the executable
> bits on all memfds immediately, clearly stating their intention, without
> requiring fixups after creating the memfd. In particular, it is highly
> useful for existing use-cases that cannot allow new seals to appear on
> memfds.
>
> Signed-off-by: David Rheinsberg <david@xxxxxxxxxxxx>
> ---
> include/linux/pid_namespace.h | 4 ++--
> include/uapi/linux/memfd.h | 1 +
> mm/memfd.c | 29 ++++++++++++++---------------
> 3 files changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index c758809d5bcf..02f8acf94512 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -19,9 +19,9 @@ struct fs_pin;
> #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> /*
> * sysctl for vm.memfd_noexec
> - * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
> + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC
> * acts like MFD_EXEC was set.
> - * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
> + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC
> * acts like MFD_NOEXEC_SEAL was set.
> * 2: memfd_create() without MFD_NOEXEC_SEAL will be
> * rejected.
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 273a4e15dfcf..b9e13bd65817 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -12,6 +12,7 @@
> #define MFD_NOEXEC_SEAL 0x0008U
> /* executable */
> #define MFD_EXEC 0x0010U
> +#define MFD_NOEXEC 0x0020U
>
> /*
> * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/memfd.c b/mm/memfd.c
> index e763e76f1106..2afe49368fc5 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -266,7 +266,9 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
> #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> +#define MFD_ALL_FLAGS \
> + (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> + MFD_NOEXEC_SEAL | MFD_EXEC | MFD_NOEXEC)
>
> SYSCALL_DEFINE2(memfd_create,
> const char __user *, uname,
> @@ -289,11 +291,13 @@ SYSCALL_DEFINE2(memfd_create,
> return -EINVAL;
> }
>
> - /* Invalid if both EXEC and NOEXEC_SEAL are set.*/
> - if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
> + if (flags & MFD_NOEXEC_SEAL)
> + flags |= MFD_ALLOW_SEALING | MFD_NOEXEC;
> +
> + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC))
> return -EINVAL;
>
> - if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
> + if (!(flags & (MFD_EXEC | MFD_NOEXEC))) {
> #ifdef CONFIG_SYSCTL
> int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
> struct pid_namespace *ns;
> @@ -366,20 +370,15 @@ SYSCALL_DEFINE2(memfd_create,
> file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> file->f_flags |= O_LARGEFILE;
>
> - if (flags & MFD_NOEXEC_SEAL) {
> - struct inode *inode = file_inode(file);
> + if (flags & MFD_NOEXEC)
> + file_inode(file)->i_mode &= ~0111;
>
> - inode->i_mode &= ~0111;
> - file_seals = memfd_file_seals_ptr(file);
> - if (file_seals) {
> + file_seals = memfd_file_seals_ptr(file);
> + if (file_seals) {
> + if (flags & MFD_ALLOW_SEALING)
> *file_seals &= ~F_SEAL_SEAL;
> + if (flags & MFD_NOEXEC_SEAL)
> *file_seals |= F_SEAL_EXEC;
> - }
> - } else if (flags & MFD_ALLOW_SEALING) {
> - /* MFD_EXEC and MFD_ALLOW_SEALING are set */
> - file_seals = memfd_file_seals_ptr(file);
> - if (file_seals)
> - *file_seals &= ~F_SEAL_SEAL;
> }
>
> fd_install(fd, file);
> --
> 2.41.0
>