Re: Building a BSD-jail clone out of namespaces

From: Eric W. Biederman
Date: Fri Jun 07 2013 - 00:07:29 EST


Chris Webb <chris@xxxxxxxxxxxx> writes:

> "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> writes:
>
>> Hmm. I guess it depends on how your VM is reading them. If it is
>> blocked based access to the filesystem you have a problem. If the VM
>> is effectively NFS mounting the filesystem you can do all kinds of
>> things.
>>
>> It is possible to just change the user namespace and setup your mapping,
>> effectively running your VM in the user namespace, and that would allow
>> the VM to see your mapped uids.
>
> In some cases I was thinking of mounting a filesystem directly from a block
> device, but more often it would be directories in a local host filesystem.
> I use qemu's built in virtio 9p-over-pci to pass these in at present.

Interesting. I hadn't seen that feature. That makes 9p much more
interesting that I thought it was.

> So in principle, that does mean I could store UIDs translated and wrap
> everything else I do at host level in a userns translation layer as well,
> but it's quite an intrusive thing to do and I imagine it would preclude
> lightweight throwaway containers where I share the host filesystem read-only
> into a container.

Not being able to share the host filesystem into a container is a
downside of the current implementation. In principle you can have an
overlay style filesystem that munges the uids and removes this
limitation, but that doesn't currently exist.

> This is why I was quite keen to avoid mangled ownerships in the host
> filesystems at all, but from what you say, that goal sounds like this might
> be rather tricky to achieve.

If you don't try to share the host root filesystem you can achieve the
sharing pretty easily by just running qemu in a user namespace. So that
qemu or whatever else serves the 9p protocol sees the filesystem with all
of the uids and gids translated.

>> There are too many things in /proc and /sys and similar that
>> grant access to uid == 0.
>
> Ah yes, I can see why this is a thorny one. Is it just the synthetic
> filesystems like /proc and /sys that are the problem, or are there loads of
> other places in the kernel that assume uid == 0 implies privilege? I.e. is
> it 'just' a matter of somehow securing access to procfs and sysfs, or a much
> wider issue?

It is a wider issue. Capabilities cover most of places in the kernel
where the kernel tests if you have privilege but there are other
filesystems like devtmpsfs, and the occasional silly piece of kernel
code that should be using capabilities but is not. Beyond the kernel
there are files like /etc/shadow that only root is allowed to read.

Which all boils down to the fact that for the inconvience of using a
separate range of uids a lot of other problems just go away.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/