Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas

From: Mike Rapoport
Date: Sun May 16 2021 - 03:31:46 EST


On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote:
> > #ifdef CONFIG_IA64
> > # include <linux/efi.h>
> > @@ -64,6 +65,9 @@ static inline int valid_mmap_phys_addr_range(unsigned long pfn, size_t size)
> > #ifdef CONFIG_STRICT_DEVMEM
> > static inline int page_is_allowed(unsigned long pfn)
> > {
> > + if (pfn_valid(pfn) && page_is_secretmem(pfn_to_page(pfn)))
> > + return 0;
> > +
>
> 1. The memmap might be garbage. You should use pfn_to_online_page() instead.
>
> page = pfn_to_online_page(pfn);
> if (page && page_is_secretmem(page))
> return 0;
>
> 2. What about !CONFIG_STRICT_DEVMEM?
>
> 3. Someone could map physical memory before a secretmem page gets allocated
> and read the content after it got allocated and gets used. If someone would
> gain root privileges and would wait for the target application to (re)start,
> that could be problematic.
>
>
> I do wonder if enforcing CONFIG_STRICT_DEVMEM would be cleaner.
> devmem_is_allowed() should disallow access to any system ram, and thereby,
> any possible secretmem pages, avoiding this check completely.

I've been thinking a bit more about the /dev/mem case, it seems I was to
fast on the trigger with adding that test for page_is_secretmem().

When CONFIG_STRICT_DEVMEM=y the access to RAM is anyway forbidden and if
the user built a kernel with CONFIG_STRICT_DEVMEM=n all the physical memory
is accessible by root anyway.

We might want to default STRICT_DEVMEM to "y" for all architectures and not
only arm64, ppc and x86, but this is not strictly related to this series.

> [...]
>
> > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > new file mode 100644
> > index 000000000000..1ae50089adf1
> > --- /dev/null
> > +++ b/mm/secretmem.c
> > @@ -0,0 +1,239 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright IBM Corporation, 2021
> > + *
> > + * Author: Mike Rapoport <rppt@xxxxxxxxxxxxx>
> > + */
> > +
> > +#include <linux/mm.h>
> > +#include <linux/fs.h>
> > +#include <linux/swap.h>
> > +#include <linux/mount.h>
> > +#include <linux/memfd.h>
> > +#include <linux/bitops.h>
> > +#include <linux/printk.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/syscalls.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/secretmem.h>
> > +#include <linux/set_memory.h>
> > +#include <linux/sched/signal.h>
> > +
> > +#include <uapi/linux/magic.h>
> > +
> > +#include <asm/tlbflush.h>
> > +
> > +#include "internal.h"
> > +
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) "secretmem: " fmt
> > +
> > +/*
> > + * Define mode and flag masks to allow validation of the system call
> > + * parameters.
> > + */
> > +#define SECRETMEM_MODE_MASK (0x0)
> > +#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK
> > +
> > +static bool secretmem_enable __ro_after_init;
> > +module_param_named(enable, secretmem_enable, bool, 0400);
> > +MODULE_PARM_DESC(secretmem_enable,
> > + "Enable secretmem and memfd_secret(2) system call");
> > +
> > +static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > +{
> > + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> > + struct inode *inode = file_inode(vmf->vma->vm_file);
> > + pgoff_t offset = vmf->pgoff;
> > + gfp_t gfp = vmf->gfp_mask;
> > + unsigned long addr;
> > + struct page *page;
> > + int err;
> > +
> > + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > + return vmf_error(-EINVAL);
> > +
> > +retry:
> > + page = find_lock_page(mapping, offset);
> > + if (!page) {
> > + page = alloc_page(gfp | __GFP_ZERO);
>
> We'll end up here with gfp == GFP_HIGHUSER (via the mapping below), correct?

Yes

> > + if (!page)
> > + return VM_FAULT_OOM;
> > +
> > + err = set_direct_map_invalid_noflush(page, 1);
> > + if (err) {
> > + put_page(page);
> > + return vmf_error(err);
>
> Would we want to translate that to a proper VM_FAULT_..., which would most
> probably be VM_FAULT_OOM when we fail to allocate a pagetable?

That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ.

> > + }
> > +
> > + __SetPageUptodate(page);
> > + err = add_to_page_cache_lru(page, mapping, offset, gfp);
> > + if (unlikely(err)) {
> > + put_page(page);
> > + /*
> > + * If a split of large page was required, it
> > + * already happened when we marked the page invalid
> > + * which guarantees that this call won't fail
> > + */
> > + set_direct_map_default_noflush(page, 1);
> > + if (err == -EEXIST)
> > + goto retry;
> > +
> > + return vmf_error(err);
> > + }
> > +
> > + addr = (unsigned long)page_address(page);
> > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>
> Hmm, to me it feels like something like that belongs into the
> set_direct_map_invalid_*() calls? Otherwise it's just very easy to mess up
> ...

AFAIU set_direct_map() deliberately do not flush TLB and leave it to the
caller to allow gathering multiple updates of the direct map and doing a
single TLB flush afterwards.

> I'm certainly not a filesystem guy. Nothing else jumped at me.
>
>
> To me, the overall approach makes sense and I consider it an improved
> mlock() mechanism for storing secrets, although I'd love to have some more
> information in the log regarding access via root, namely that there are
> still fancy ways to read secretmem memory once root via
>
> 1. warm reboot attacks especially in VMs (e.g., modifying the cmdline)
> 2. kexec-style reboot attacks (e.g., modifying the cmdline)
> 3. kdump attacks
> 4. kdb most probably
> 5. "letting the process read the memory for us" via Kees if that still
> applies
> 6. ... most probably something else
>
> Just to make people aware that there are still some things to be sorted out
> when we fully want to protect against privilege escalations.
>
> (maybe this information is buried in the cover letter already, where it
> usually gets lost)

I believe that it belongs more to the man page than to changelog so that
the *users* are aware of secretmem limitations.

--
Sincerely yours,
Mike.