Re: [RFC PATCH v1 0/8] Introduce mseal() syscall

From: Matthew Wilcox
Date: Wed Oct 18 2023 - 11:18:06 EST


On Tue, Oct 17, 2023 at 08:18:47PM -0700, Jeff Xu wrote:
> In practice: libc could do below:
> #define MM_IMMUTABLE
> (MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP)
> mseal(add,len, MM_IMMUTABLE)
> it will be equivalent to BSD's immutable().

No, it wouldn't, because you've carefully listed the syscalls you're
blocking instead of understanding the _concept_ of what you need to
block.

> In linux cases, I think, eventually, mseal() will have a bigger scope than
> BSD's mimmutable(). VMA's metadata(vm_area_struct) contains a lot
> of control info, depending on application's needs, mseal() can be
> expanded to seal individual control info.
>
> For example, in madvice(2) case:
> As Jann point out in [1] and I quote:
> "you'd probably also want to block destructive madvise() operations
> that can effectively alter region contents by discarding pages and
> such, ..."
>
> Another example: if an application wants to keep a memory always
> present in RAM, for whatever the reason, it can call seal the mlock().
>
> To handle those two new cases. mseal() could add two more bits:
> MM_SEAL_MADVICE, MM_SEAL_MLOCK.

Yes, thank you for demonstrating that you have no idea what you need to
block.

> It is practical to keep syscall extentable, when the business logic is the same.

I concur with Theo & Linus. You don't know what you're doing. I think
the underlying idea of mimmutable() is good, but how you've split it up
and how you've implemented it is terrible.

Let's start with the purpose. The point of mimmutable/mseal/whatever is
to fix the mapping of an address range to its underlying object, be it
a particular file mapping or anonymous memory. After the call succeeds,
it must not be possible to make any address in that virtual range point
into any other object.

The secondary purpose is to lock down permissions on that range.
Possibly to fix them where they are, possibly to allow RW->RO transitions.

With those purposes in mind, you should be able to deduce for any syscall
or any madvise(), ... whether it should be allowed.

Look, I appreciate this is only your second set of patches to Linux and
you've taken on a big job. But that's all the more reason you should
listen to people who are trying to help you.