Re: Building a BSD-jail clone out of namespaces

From: Eric W. Biederman
Date: Thu Jun 06 2013 - 12:58:25 EST


Chris Webb <chris@xxxxxxxxxxxx> writes:

> Prompted by the new userns support merged in the 3.8/3.9 kernels, I've been
> playing with namespaces and trying to understand how I could use them to
> build containers to replace some of my uses of qemu-kvm virtual machines.
>
> I've successfully created a fakeroot-type container running as an
> unprivileged user by unsharing everything including CLONE_NEWUSER, and can
> map a block of host UIDs for that environment by writing to
> /proc/PID/[ug]id_map from a helper process running as root.
>
> However, what I'm hoping for in practice is to be able to create containers
> whose access to its filesystem subtree is untranslated, i.e. uid/gid N in
> the container maps to uid/gid N in a subdirectory of the filesystem, but
> which is still isolated from the rest of the host filesystem and can't do
> externally privileged things. This is pretty much what a BSD jail provides,
> for example.
>
> Is this possible to achieve securely using the mechanisms now available?
> (I'm assuming that parent directory permissions prevent unprivileged host
> users from getting at these container filesystems, exactly as is necessary
> to make BSD jails safe.)
>
>
> As a first step, I naively tried running as root and unsharing everything
> with
>
> unshare(CLONE_NEWIPC | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWPID
> | CLONE_NEWUTS | CLONE_NEWUSER);
>
> before execing a shell[1]. From another root process in the host namespace,
> I then wrote a pass-through mapping 0 0 4294967295 to /proc/PID/[ug]id_map.

That will work, but you really don't want to run with uid == 0 mapped to
uid == 0. There are too many things in /proc and /sys and similar that
grant access to uid == 0.

> The result initially looks plausible, with the PID namespace preventing
> signals being sent from one container to another, despite those processes
> sharing the same user ID in the top-level user namespace.
>
> However, unfortunately I still have too many privileges with respect to the
> host. Whilst (for example) I can't mknod, I can mount a sysfs or procfs and
> apparently write to them with host root privileges to reconfigure the host
> kernel. I suspect there will be other things I haven't secured by this
> recipe too.

Yes. I recommend having a dedicated range of uids for your container to
prevent this kind of silliness. Or at the very least a separate mapping
of uid == 0.

> I also tried tightening things up by dropping capabilities from my root user
> and preventing capability grant on exec by setting and locking SECBIT_NOROOT
> on before starting the container. However, I'm not sure this really makes
> any difference---does CLONE_NEWUSER drop all capabilities with respect to
> the parent namespace?

Yes. CLONE_NEWUSER drops all capabilities with respect to the parent
namespace.

> [1] In this description, I'm ignoring the part where I lock into a new root
> filesystem, but presumably the way to do this is by pivot_root into a bind
> mount?

Yes pivot_root and bind mount work.

ERic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/