Re: [RFC PATCH v1 0/8] Introduce mseal() syscall

From: Theo de Raadt
Date: Tue Oct 17 2023 - 14:26:53 EST

Next message: Greg Kroah-Hartman: "Re: [PATCH] staging: rtl8192u: remove entry from Makefile"
Previous message: Ryan Roberts: "Re: [PATCH v2 1/2] zswap: make shrinking memcg-aware"
In reply to: Linus Torvalds: "Re: [RFC PATCH v1 0/8] Introduce mseal() syscall"
Next in thread: Linus Torvalds: "Re: [RFC PATCH v1 0/8] Introduce mseal() syscall"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@xxxxxxxxxx> wrote:
> >
> > It is probably worth noting that I choose to check one and only
> > one sealing type per syscall. i.e. munmap(2) checks
> > MM_SEAL_MUNMAP only.
>
> Yeah, this is wrong.
>
> It's wrong exactly because other system calls will unmap things too.
>
> Using mmap() to over-map something will unmap the old one.
>
> Same goes for mremap() to move over an existing mapping.
>
> So the whole "do things by the name of the system call" is not workable.
>
> All that matters is what the system calls *do*, not what their name is.

I agree completely...

mseal() is a clone of mimmutable(2), but with an extremely
over-complicated API based upon dubious arguments.

I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all
the components working correctly. There were many intermediate API
during development, but in the end the API is simply:

int mimmutable(void *addr, size_t len);

The kernel code for mimmutable() traverses the specified VA range. In
that range, it will find unmapped sub-regions (which are are ignored)
and mapped sub-regions. For these mapped regions, it does not care what
the permissions are, it just marks each sub-region as immutable.

Later on, when any VM operation request upon a VA range attempts to
(1) change the permissions
(2) to re-map on top
(3) or dispose of the mapping,
that operation is refused with errno EPERM. We don't care where the
request comes from (ie. what system call). It is a behaviour of the
VM system, when asked to act upon a VA sub-range mapping.

Very simple semantics.

The only case where the immutable marker is ignored is during address space
teardown as a result of process termination.

In his submission of this API, Jeff Xu makes three claims I find dubious;

> Also, Chrome wants to adopt this feature for their CFI work [2] and this
> patchset has been designed to be compatible with the Chrome use case.

I specifically designed mimmutable(2) with chrome in mind, and the
chrome binary running on OpenBSD is full of immutable mappings. All the
library regions automatically become immutable because ld.so can infer
it and do the mimmutable calls for the right subregions.

So this chrome work has already been done by OpenBSD, and it is dead
simple. During early development I thought mimmutable(2) would be
called by applications or libraries, but I was dead wrong: 99.9% of
calls are from ld.so, and no applications need to call it, these are the
two exceptions:

In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data
structures at initialization time, so they canoot be attacked to create
an invariant for use in ROP return-to-libc style methods.

In Chrome, there is a v8_flags variable rounded out to a full page, and
placed in .data. Chrome initialized this variable, and wants to mprotect
PROT_READ, but .data has been made immutable by ld.so. So we force this
page into a new ELF section called "openbsd.mutable" which also behaves RW
like .data. Where chrome does the mprotect PROT_READ, it now also performs
mimmutable() on that page.

> Having a seal type per syscall type helps to add the feature incrementally.

Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our
entire base operating system and 10,000+ applications automatically receive
the benefits. In one year's effort. The only application which cared about
it was chrome, described in the previous paragraph.

I think Jeff's idea here is super dangerous. What will actually happen
is people will add a few mseal() sub-operations and think the job is done.
It isn't done. They need all the mseal() requests, or the mapping are
not safe.

It is very counterproductive to provide developers a complex API that has
insecure suboperations.

> Applications also know exactly what is sealed.

Actually applicatins won't know because there is no tooling to inspect this --
but I will argue further that applications don't need to know. Immutable
marking is a system decision, not a program decision.

I'll close by asking for a new look at the mimmutable(2) API we settled
on for OpenBSD. I think there is nothing wrong with it. I'm willing to
help guide glibc / ld.so / musl teams through the problems they may find
along the way, I know where the skeletons are buried. Two in
particular: -znow RELRO already today, and xonly-text in the future.

[1] https://man.openbsd.org/mimmutable.2

Next message: Greg Kroah-Hartman: "Re: [PATCH] staging: rtl8192u: remove entry from Makefile"
Previous message: Ryan Roberts: "Re: [PATCH v2 1/2] zswap: make shrinking memcg-aware"
In reply to: Linus Torvalds: "Re: [RFC PATCH v1 0/8] Introduce mseal() syscall"
Next in thread: Linus Torvalds: "Re: [RFC PATCH v1 0/8] Introduce mseal() syscall"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]