Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

From: Topi Miettinen
Date: Thu Oct 22 2020 - 06:39:31 EST


On 22.10.2020 10.54, Szabolcs Nagy wrote:
The 10/21/2020 22:44, Jeremy Linton wrote:
There is a problem with glibc+systemd on BTI enabled systems. Systemd
has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
caught by the seccomp filter, resulting in service failures.

So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
This is obviously not desirable.

Various changes have been suggested, replacing the mprotect with mmap calls
having PROT_BTI set on the original mapping, re-mmapping the segments,
implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
and various modification to seccomp to allow particular mprotect cases to
bypass the filters. In each case there seems to be an undesirable attribute
to the solution.

So, whats the best solution?

the easy fix in glibc is to ignore mprotect(PROT_BTI|PROT_EXEC)
failures, so programs work with seccomp filters, but bti gets
disabled (it's unreasonable to expect bti protection if mprotect
is filtered). it will be a nasty silent failure though.

Some may also want to use seccomp filters so that they will immediately kill the process and in this case they couldn't do it.

and i'm also considering a fix that re-mmaps the executable
segment with PROT_BTI instead of mprotect since that is not
filtered. unfortunately the main exe is mmaped by the kernel
without PROT_BTI and the libc does not have the fd to re-mmap.
(bti can be left off for the main exe if mprotect fails and
later we can teach the kernel to add bti there.) currently
this is not a complete fix so i'm a bit hesitant about it.

as for a kernel side fix: if there is a way to only filter
PROT_EXEC mprotect on mappings that are not yet PROT_EXEC
that would solve this problem (but likely needs new syscall
or seccomp capability).

Problem with seccomp MDWX is that it's still possible for malicious programs to circumvent the filter by using memfd_create(), fill the memory with desired content and then use mmap(,,PROT_EXEC) to make it executable without triggering seccomp. This can be mitigated by filtering also memfd_create(), but then some programs want to use it. Also the protection can be bypassed if the program can write to a file system which isn't mounted with "noexec". This can be mitigated with private mount namespaces and global mount options, but again some programs are written to expect W & X.

But I think SELinux has a more complete solution (execmem) which can track the pages better than is possible with seccomp solution which has a very narrow field of view. Maybe this facility could be made available to non-SELinux systems, for example with prctl()? Then the in-kernel MDWX could allow mprotect(PROT_EXEC | PROT_BTI) in case the backing file hasn't been modified, the source filesystem isn't writable for the calling process and the file descriptor isn't created with memfd_create().

-Topi