Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE

From: Christian Brauner
Date: Sat Aug 14 2021 - 03:53:57 EST


On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote:
> On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote:
> > On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
> > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > And nobody really complained when we weakened it, so maybe removing it
> > > entirely might be acceptable.
> >
> > I guess we could just try it and see... Worst comes to worst, we'll
> > have to put it back, but at least we'd know what crazy thing still
> > wants it..
>
> Umm... I'll need to go back and look through the thread, but I'm
> fairly sure that there used to be suckers that did replacement of
> binary that way (try to write, count on exclusion with execve while
> it's being written to) instead of using rename. Install scripts
> of weird crap and stuff like that...

I'm not agains trying to remove it, but I think Al has a point.

Removing the write protection will also most certainly make certain
classes of attacks _easier_. For example, the runC container breakout
from last year using privileged containers issued CVE-2019-5736 would be
easier. I'm quoting from the commit I fixed this with:

The attack can be made when attaching to a running container or when starting a
container running a specially crafted image. For example, when runC attaches
to a container the attacker can trick it into executing itself. This could be
done by replacing the target binary inside the container with a custom binary
pointing back at the runC binary itself. As an example, if the target binary
was /bin/bash, this could be replaced with an executable script specifying the
interpreter path #!/proc/self/exe (/proc/self/exec is a symbolic link created
by the kernel for every process which points to the binary that was executed
for that process). As such when /bin/bash is executed inside the container,
instead the target of /proc/self/exe will be executed - which will point to the
runc binary on the host. The attacker can then proceed to write to the target
of /proc/self/exe to try and overwrite the runC binary on the host.

and then the write protection kicks in of course:

However in general, this will not succeed as the kernel will not
permit it to be overwritten whilst runC is executing.

which the attack can of course already overcome nowadays with minimal
smarts:

To overcome this, the attacker can instead open a file descriptor to
/proc/self/exe using the O_PATH flag and then proceed to reopen the
binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it
in a busy loop from a separate process. Ultimately it will succeed
when the runC binary exits. After this the runC binary is
compromised and can be used to attack other containers or the host
itself.

But with write protection removed you'd allow such attacks to succeed
right away. It's not a huge deal to remove it since we need to have
other protection mechanisms in place already:

To prevent this attack, LXC has been patched to create a temporary copy of the
calling binary itself when it starts or attaches to containers. To do this LXC
creates an anonymous, in-memory file using the memfd_create() system call and
copies itself into the temporary in-memory file, which is then sealed to
prevent further modifications. LXC then executes this sealed, in-memory file
instead of the original on-disk binary. Any compromising write operations from
a privileged container to the host LXC binary will then write to the temporary
in-memory binary and not to the host binary on-disk, preserving the integrity
of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed,
writes to this will also fail.

Note: memfd_create() was added to the Linux kernel in the 3.17 release.

However, I still like to pich the upgrade mask idea Aleksa and we tried
to implement when we did openat2(). If we leave write-protection in
preventing /proc/self/exe from being written to:

we can take some time and upstream the upgrade mask patchset which was
part of the initial openat2() patchset but was dropped back then (and I
had Linus remove the last remants of the idea in [1]).

The idea was to add a new field to struct open_how "upgrade_mask" that
would allow a caller to specify with what permissions an fd could be
reopened with. I still like this idea a great deal and it would be a
very welcome addition to system management programs. The upgrade mask is
of course optional, i.e. the caller would have to specify the upgrade
mask at open time to restrict reopening (lest we regress the whole
world).

But, we could make it so that an O_PATH fd gotten from opening
/proc/<pid>/exe always gets a restricted upgrade mask set and so it
can't be upgraded to a O_WRONLY fd afterwards. For this to be
meaningful, write protection for /proc/self/exe would need to be kept.

[1]: commit 5c350aa11b441b32baf3bfe4018168cb8d10cef7
Author: Christian Brauner <christian.brauner@xxxxxxxxxx>
Date: Fri May 28 11:24:15 2021 +0200

fcntl: remove unused VALID_UPGRADE_FLAGS

We currently do not maky use of this feature and should we implement
something like this in the future it's trivial to add it back.

Link: https://lore.kernel.org/r/20210528092417.3942079-2-brauner@xxxxxxxxxx
Cc: Christoph Hellwig <hch@xxxxxx>
Cc: Aleksa Sarai <cyphar@xxxxxxxxxx>
Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Cc: linux-fsdevel@xxxxxxxxxxxxxxx
Suggested-by: Richard Guy Briggs <rgb@xxxxxxxxxx>
Reviewed-by: Richard Guy Briggs <rgb@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx>