Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters

From: Andy Lutomirski
Date: Sat May 11 2019 - 18:41:36 EST


> On May 11, 2019, at 10:21 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
>> On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>
>> A better âspawnâ API should fix this.
>
> Andy, stop with the "spawn would be better".

It doesnât have to be spawn per se. But the current situation sucks.

>
> Notice? None of the real problems are about execve or would be solved
> by any spawn API. You just think that because you've apparently been
> talking to too many MS people that think fork (and thus indirectly
> execve()) is bad process management.
>
>

Iâve literally never spoken to an MS person about it.

What container managers and init systems *want* is a way to drop
privileges, change namespaces, etc and then run something in a
controlled way so that the intermediate states arenât dangerous. An
API for this could be spawn-like or exec-like â that particular
distinction is beside the point. Having personally written code that
mucks with namepsaces, I've wanted two particular abilities that are
both quite awkward:

a) Change all my UIDs and GIDs to match a container, enter that
container's namespaces, and run some binary in the container's
filesystem, all atomically enough that I don't need to worry about
accidentally leaking privileges into the container. A
super-duper-non-dumpable mode would kind of allow this, but I'd worry
that there's some other hole besides ptrace() and /proc/self.

b) Change all my UIDs and GIDs to match a container, enter that
container's namespaces, and run some binary that is *not* in the
container's filesystem. This happens, for example, if the container's
mount namespace has no exec mounts at all. We don't have a fantastic
way to do this at all right now due to /proc/self/exe.

Regardless, the actual CVE at hand would have been nicely avoided if
writing to /proc/self/exe didnât work, and I see no reason we canât
make that happen.

I suppose we could also consider a change to disable /proc/self/exe if
it's not reachable from /proc/self/root. By "disable", I mean that
readlink() should maybe still work, but actually trying to open it
could probably fail safely.