Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

From: Andy Lutomirski
Date: Fri May 27 2016 - 19:21:03 EST


On May 27, 2016 3:38 PM, "Kees Cook" <keescook@xxxxxxxxxxxx> wrote:
>
> On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> > On May 27, 2016 11:42 AM, "Kees Cook" <keescook@xxxxxxxxxxxx> wrote:
> >>
> >> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> >> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> >> >>>> One problem with seccomp was that ptrace could be used to change a
> >> >>>> syscall after seccomp filtering had completed. This was a well documented
> >> >>>> limitation, and it was recommended to block ptrace when defining a filter
> >> >>>> to avoid this problem. This can be quite a limitation for containers or
> >> >>>> other places where ptrace is desired even under seccomp filters.
> >> >>>>
> >> >>>> Since seccomp filtering has been split into pre-trace and trace phases
> >> >>>> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
> >> >>>> after ptrace. This makes that change, and updates the test suite for
> >> >>>> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
> >> >>>
> >> >>> I like fixing the hole, but I don't like this fix.
> >> >>>
> >> >>> The two-phase seccomp mechanism is messy. I wrote it because it was a
> >> >>> huge speedup. Since then, I've made a ton of changes to the way that
> >> >>> x86 syscalls work, and there are two relevant effects: the slow path
> >> >>> is quite fast, and the phase-1-only path isn't really a win any more.
> >> >>>
> >> >>> I suggest that we fix the by simplifying the code instead of making it
> >> >>> even more complicated. Let's back out the two-phase mechanism (but
> >> >>> keep the ability for arch code to supply seccomp_data) and then just
> >> >>> reorder it so that seccomp happens after ptrace. The result should be
> >> >>> considerably simpler. (We'll still have to answer the question of
> >> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
> >> >>> maybe the answer is to just let it through -- after all,
> >> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
> >> >>> internal filtering.)
> >> >>
> >> >> I'm really against this. I think seccomp needs to stay first,
> >> >
> >> > Why? What use case is improved with it going first?
> >>
> >> I feel that the critical purpose of seccomp is to minimize attack
> >> surface. To that end, I am strongly against anything coming before it
> >> in the syscall path. I really do not want ptrace going first, I think
> >> it's just asking for bugs.
> >
> > I disagree in this case. There's no actual code surface opened up.
> > If seccomp allows even a single syscall through and there's a ptracer
> > attached, then the ptrace code is exposed. As far as ptrace is
> > concerned, the syscall number is just a number, and ptrace has
> > basically no awareness of the arguments.
>
> No, I completely disagree: there is a significant amount of surface
> exposed. With a tracer attached there is significantly more happening
> before the filter would be checked. Even less obvious things like
> signal delivery, etc get exposed. Seccomp must be first -- this is
> it's basic design principle. Bugs creep in, unexpected combinations
> creep in, etc. Seccomp must mitigate this and be first on the syscall
> path. The paranoia of this design principle must remain in place, even
> at the expense of some inelegant results.

But this only works if the filter is literally "deny everything". If
there is even a single syscall allowed and a ptracer is attached, then
the whole ptrace machinery is exposed anyway.

Users who are this paranoid about attack surface need to disable
ptrace, full stop. If you can do the ptrace(2) syscall, then you can
invoke all the nasty code paths by yourself, and there is nothing
seccomp can do about it. All seccomp can do is prevent ptrace from
generating a syscall that would otherwise be filtered out.

Let's look at the actual supposed attack surface:


if (unlikely(work & _TIF_SYSCALL_EMU))
ret = -1L;

if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
tracehook_report_syscall_entry(regs))
ret = -1L;

That's all.

The only way that TIF_SYSCALL_EMU or TIF_SYSCALL_TRACE gets set is if
a ptracer is attached and uses PTRACE_SYSEMU, PTRACE_SYSCALL or
similar. If that has happened, then there's very, very little that
seccomp can possibly do to reduce attack surface. First, literally
any syscall that results in SECCOMP_RET_OK will cause all of the same
kernel code paths to run. Second, most of those code paths can be
triggered without any syscall at all by using PTRACE_SINGLESTEP
instead.

So I challenge you to find a realistic scenario (i.e. something that a
real program might actually program into seccomp) in which running
seccomp before ptrace avoids even a single line of code worth of
attack surface.

On the flip side, changing the order has lots of benefits:

1. strace can actually be used to figure out why your program is
getting killed by seccomp. (IIRC there was a long article about
trying to debug something in the chromium graphics sandbox that was
resulting in a seccomp kill. strace would have make it easier if it
worked.)

2. The infamous ptrace hold gets plugged with much less code (and, if
we want to plug it for SECCOMP_RET_TRACE, that's still less code than
with your patch). I would argue that reducing the amount of code in
seccomp is more important from an attack surface perspective than
restricting the values that a bunch of integers that the kernel's
ptrace code doesn't even care about.

3. IMO it makes more sense. A debugger is trying to debug. I think
it makes more sense for it to see syscalls as requested by the issuing
process rather than syscalls as modified by seccomp.


> >
> > The logic could be:
> >
> > if (ptraced)
> > invoke ptrace;
> >
> > if (seccomp) {
> > evaluate the filter;
> > if (result == SECCOMP_RET_TRACE) {
> > evaluate filter again;
> > if (result != SECCOMP_RET_TRACE)
> > perform the action;
> > }
>
> We must run seccomp first. So:
>
> if (seccomp) {
> evaluate the filter;
> if (result == SECCOMP_RET_TRACE) {
> invoke ptrace;
> evaluate filter again;
> }
> }
>
> if (ptraced) {
> invoke ptrace;
> evaluate filter again;
> }

This is even more code, and I still disagree with your "must".

--Andy