Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.

From: Ingo Molnar
Date: Thu May 07 2009 - 18:15:26 EST



(i've restored the Cc: line of the previous thread)

* Adam Langley <agl@xxxxxxxxxx> wrote:

> (This is a discussion email rather than a patch which I'm
> seriously proposing be landed.)
>
> In a recent thread[1] my colleague, Markus, mentioned that we
> (Chrome Linux) are investigating using seccomp to implement our
> rendering sandbox[2] on Linux.
>
> In the same thread, Ingo mentioned[3] that he thought a bitmap of
> allowed system calls would be reasonable. If we had such a thing,
> many of the acrobatics that we currently need could be avoided.
> Since we need to support the currently existing kernels, we'll
> need to have the code for both, but allowing signal handling,
> gettimeofday, epoll etc would save a lot of overhead for common
> operations.
>
> The patch below implements such a scheme. It's written on top of
> the current seccomp for the moment, although it looks like seccomp
> might be written in terms of ftrace soon[4].
>
> Briefly, it adds a second seccomp mode (2) where one uploads a
> bitmask. Syscall n is allowed if, and only if, bit n is true in
> the bitmask. If n is beyond the range of the bitmask, the syscall
> is denied.
>
> If prctl is allowed by the bitmask, then a process may switch to
> mode 1, or may set a new bitmask iff the new bitmask is a subset
> of the current one. (Possibly moving to mode 1 should only be
> allowed if read, write, sigreturn, exit are in the currently
> allowed set.)
>
> If a process forks/clones, the child inherits the seccomp state of
> the parent. (And hopefully I'm managing the memory correctly
> here.)
>
> Ingo subsequently floated the idea of a more expressive interface
> based on ftrace which could introspect the arguments, although I
> think the discussion had fallen off list at that point.
>
> He suggested using an ftrace parser which I'm not familiar with, but can
> be summed up with:
> seccomp_prctl("sys_write", "fd == 3") // allow writes only to fd 3

It's the ftrace filter parser and execution engine.

I.e. we first parse the filter expression when setting up a seccomp
context. Each syscall has the following attributes:

on # enabled unconditionally
off # disabled unconditionally
filtered

In the filtered case, the filter can be simple:

"fd == 0"

To restrict sys_write() to a single fd (but still allow sys_read()
from other fds).

Or as complex as:

(fd == 4 || fd == 5) && (buf == 0x12340000) && (size <= 4096)

To restrict IO to two specific fds and to restrict output to a
specific memory address and to restrict size to 4K or smaller.

This is how the filter engine works: we parse the string and save it
into a binay expression structure (cache) that can later on be run
by the engine in a pretty fast way. (without any string parsing or
formatting overhead in the validation fastpath)

The filter is thus evaluated in the sandbox task's context, without
the need for any context-switching. It's very, very fast. It is i
think faster than LSM rules, and it is also atomic and lockless (RCU
based).

> In general, I believe that ftrace based solutions cannot safely
> validate arguments which are in user-space memory when multiple
> threads could be racing to change the memory between ftrace and
> the eventual copy_from_user. Because of this, many useful
> arguments (such as the sockaddr to connect, the filename to open
> etc) are out of reach. LSM hooks appear to be the best way to
> impose limits in such cases. (Which we are also experimenting
> with).

That assessment is incorrect, there's no difference between safety
here really.

LSM cannot magically inspect user-space memory either when multiple
threads may access it. The point would be to define filters for
system call _arguments_, which are inherently thread-local and safe.

> However, such a parser could be very useful in one particular
> case: socketcall on IA32. Allowing recvmsg and sendmsg, but not
> socket, connect etc is certainly something that we would be
> interested in.

There are two problems with the bitmap scheme, which i also
suggested in a previous thread but then found it to be lacking:

1) enumeration: you define a bitmap. That will be problematic
between compat and native 64-bit (both have different syscall
vectors).

2) flexibility. It's an on/off selection per syscall. With the
filter we have on, off, or filtered. That's a _whole_ lot more
flexible.

The filter expression based solution does not suffer from this: it
is string enumerated. "sys_read" means that syscall, and we could
specify whether it's the compat or the native one.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/