Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

From: Jann Horn
Date: Tue Nov 24 2020 - 12:31:19 EST


On Tue, Nov 24, 2020 at 6:15 PM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Nov 24, 2020 at 06:06:38PM +0100, Jann Horn wrote:
> > +seccomp maintainers/reviewers
> > [thread context is at
> > https://lore.kernel.org/linux-api/87lfer2c0b.fsf@xxxxxxxxxxxxxxxxxxxxxxxxx/
> > ]
> >
> > On Tue, Nov 24, 2020 at 5:49 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> > > On Tue, Nov 24, 2020 at 03:08:05PM +0100, Mark Wielaard wrote:
> > > > For valgrind the issue is statx which we try to use before falling back
> > > > to stat64, fstatat or stat (depending on architecture, not all define
> > > > all of these). The problem with these fallbacks is that under some
> > > > containers (libseccomp versions) they might return EPERM instead of
> > > > ENOSYS. This causes really obscure errors that are really hard to
> > > > diagnose.
> > >
> > > So find a way to detect these completely broken container run times
> > > and refuse to run under them at all. After all they've decided to
> > > deliberately break the syscall ABI. (and yes, we gave the the rope
> > > to do that with seccomp :().
> >
> > FWIW, if the consensus is that seccomp filters that return -EPERM by
> > default are categorically wrong, I think it should be fairly easy to
> > add a check to the seccomp core that detects whether the installed
> > filter returns EPERM for some fixed unused syscall number and, if so,
> > prints a warning to dmesg or something along those lines...
>
> Why? seccomp is saying "this syscall is not permitted", so -EPERM seems
> like the correct error to provide here. It's not -ENOSYS as the syscall
> is present.
>
> As everyone knows, there are other ways to have -EPERM be returned from
> a syscall if you don't have the correct permissions to do something.
> Why is seccomp being singled out here? It's doing the correct thing.

AFAIU from what the others have said, it's being singled out because
it means that for two semantically equivalent operations (e.g.
openat() vs open()), one can fail while the other works because the
filter doesn't know about one of the syscalls. Normally semantically
equivalent syscalls are supposed to be subject to the same checks, and
if one of them fails, trying the other one won't help.

But if you can't tell whether the more modern syscall failed because
of a seccomp filter, you may be forced to retry with an older syscall
even on systems where the new syscall works fine, and such a fallback
may reduce security or reliability if you're trying to use some flags
that only the new syscall provides for security, or something like
that. (As a contrived example, imagine being forced to retry any
tgkill() that fails with EPERM as a tkill() just in case you're
running under a seccomp filter.)