Re: [PATCH v10 1/4] random: add vgetrandom_alloc() syscall

From: Jason A. Donenfeld
Date: Wed Nov 30 2022 - 10:42:24 EST


Hi Florian,

On Wed, Nov 30, 2022 at 11:51:59AM +0100, Florian Weimer wrote:
> * Jason A. Donenfeld:
>
> > +#ifdef CONFIG_VGETRANDOM_ALLOC_SYSCALL
> > +/**
> > + * vgetrandom_alloc - allocate opaque states for use with vDSO getrandom().
> > + *
> > + * @num: on input, a pointer to a suggested hint of how many states to
> > + * allocate, and on output the number of states actually allocated.
>
> Should userspace call this system call again if it needs more states?
> The interface description doesn't make this clear.

Yes. And indeed that's what Adhemerval's patch does.

>
> > + * @size_per_each: the size of each state allocated, so that the caller can
> > + * split up the returned allocation into individual states.
> > + *
> > + * @flags: currently always zero.
> > + *
> > + * The getrandom() vDSO function in userspace requires an opaque state, which
> > + * this function allocates by mapping a certain number of special pages into
> > + * the calling process. It takes a hint as to the number of opaque states
> > + * desired, and provides the caller with the number of opaque states actually
> > + * allocated, the size of each one in bytes, and the address of the first
> > + * state.
> > +
> > + * Returns a pointer to the first state in the allocation.
> > + *
> > + */
>
> How do we deallocate this memory? Must it remain permanently allocated?

It can be deallocated with munmap.

> Can userspace use the memory for something else if it's not passed to
> getrandom?

I suspect the documentation answer here is, "no", even if technically it
might happen to work on this kernel or that kernel. I suppose this could
even be quasi-enforced by xoring the top bits with some vdso
compile-time constant, so you can't rely on being able to dereference
it yourself.

> The separate system call strongly suggests that the
> allocation is completely owned by the kernel, but there isn't
> documentation here how the allocation life-cycle is supposed to look
> like. In particular, it is not clear if vgetrandom_alloc or getrandom
> could retain a reference to the allocation in a future implementation of
> these interfaces.
>
> Some users might want to zap the memory for extra hardening after use,
> and it's not clear if that's allowed, either.

I don't think zapping that memory is supported, or even a sensible thing
to do. In the first place, I don't think we should suggest that the user
can dereference that pointer, at all. In that sense, maybe it's best to
call it a "handle" or something similar (a "HANDLE"! a "HWND"? a "HRNG"?
just kidding). In the second place, the fast erasure aspect of this
means that such hardening would have no effect -- the key is overwritten
after using for forward secrecy, anyway, and batched bytes are zeroed.
(There is a corner case that might make it interesting to wipe in the
parent, not just the child, on fork, but that's sort of a separate
matter and would ideally be handled by kernel space anyway.)

> If there's no registration of the allocation, it's not clear why we need
> a separate system call for this. From a documentation perspective, it
> may be easier to describe proper use of the getrandom vDSO call if
> ownership resides with userspace.

No, absolutely not, for the multiple reasons already listed in the
commit messages and cover letter and previous emails. But you seem
aware of this:

> But it will constrain future
> evolution of the implementation because you can't add registration
> (retaining a reference to the passed-in area in getrandom) after the
> fact. But I'm not sure if this is possible with the current interface,
> either. Userspace has to make some assumptions about the life-cycle to
> avoid a memory leak on thread exit.

It sounds like this is sort of a different angle on Rasmus' earlier
comment about how munmap leaks implementation details. Maybe there's
something to that after all? Or not? I see two approaches:

1) Keep munmap as the allocation function. If later on we do fancy
registration and in-kernel state tracking, or add fancy protection
flags, or whatever else, munmap should be able to identify these
pages and carry out whatever special treatment is necessary.

2) Convert vgetrandom_alloc() into a clone3-style syscall, as Christian
suggested earlier, which might allow for a bit more overloading
capability. That would be a struct that looks like:

struct vgetrandom_alloc_args {
__aligned_u64 flags;
__aligned_u64 states;
__aligned_u64 num;
__aligned_u64 size_of_each;
}

- If flags is VGRA_ALLOCATE, states and size_of_each must be zero on
input, while num is the hint, as is the case now. On output, states,
size_of_each, and num are filled in.

- If flags is VGRA_DEALLOCATE, states, size_of_each, and num must be as
they were originally, and then it deallocates.

I suppose (2) would alleviate your concerns entirely, without future
uncertainty over what it'd be like to add special cases to munmap(). And
it'd add a bit more future proofing to the syscall, depending on what we
do.

So maybe I'm warming up to that approach a bit.

> > + num_states = clamp_t(size_t, num_hint, 1, (SIZE_MAX & PAGE_MASK) / state_size);
> > + alloc_size = PAGE_ALIGN(num_states * state_size);
>
> Doesn't this waste space for one state if state_size happens to be a
> power of 2? Why do this SIZE_MAX & PAGE_MASK thing at all? Shouldn't
> it be PAGE_SIZE / state_size?

The first line is a clamp. That fixes num_hint between 1 and the largest
number that when multiplied and rounded up won't overflow.

So, if state_size is a power of two, let's say 256, and there's only one
state, here's what that looks like:

num_states = clamp(1, 1, (0xffffffff & (~(4096 - 1))) / 256 = 16777200) = 1
alloc_size = PAGE_ALIGN(1 * 256) = 4096

So that seems like it's working as intended, right? Or if not, maybe
it'd help to write out the digits you're concerned about?

> > + if (put_user(alloc_size / state_size, num) || put_user(state_size, size_per_each))
> > + return -EFAULT;
> > +
> > + pages_addr = vm_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE,
> > + MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0);
>
> I think Rasmus has already raised questions about MAP_LOCKED.
>
> I think the kernel cannot rely on it because userspace could call
> munlock on the allocation.

Then they're caught holding the bag? This doesn't seem much different
from userspace shooting themselves in general, like writing garbage into
the allocated states and then trying to use them. If this is something
you really, really are concerned about, then maybe my cheesy dumb xor
thing mentioned above would be a low effort mitigation here.

Jason