Re: [PATCHv3 0/2] capability controlled user-namespaces

From: Serge E. Hallyn
Date: Tue Jan 09 2018 - 17:29:16 EST


Quoting Mahesh Bandewar (àààà ààààààà) (maheshb@xxxxxxxxxx):
> On Mon, Jan 8, 2018 at 10:36 AM, Serge E. Hallyn <serge@xxxxxxxxxx> wrote:
> > Quoting Mahesh Bandewar (àààà ààààààà) (maheshb@xxxxxxxxxx):
> >> On Mon, Jan 8, 2018 at 10:11 AM, Serge E. Hallyn <serge@xxxxxxxxxx> wrote:
> >> > Quoting Mahesh Bandewar (àààà ààààààà) (maheshb@xxxxxxxxxx):
> >> >> On Mon, Jan 8, 2018 at 7:47 AM, Serge E. Hallyn <serge@xxxxxxxxxx> wrote:
> >> >> > Quoting James Morris (james.l.morris@xxxxxxxxxx):
> >> >> >> On Mon, 8 Jan 2018, Serge E. Hallyn wrote:
> >> >> >> I meant in terms of "marking" a user ns as "controlled" type -- it's
> >> >> >> unnecessary jargon from an end user point of view.
> >> >> >
> >> >> > Ah, yes, that was my point in
> >> >> >
> >> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/01845.html
> >> >> > and
> >> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/02276.html
> >> >> >
> >> >> >> This may happen internally but don't make it a special case with a
> >> >> >> different name and don't bother users with internal concepts: simply
> >> >> >> implement capability whitelists with the default having equivalent
> >> >
> >> > So the challenge is to have unprivileged users be contained, while
> >> > allowing trusted workloads in containers created by a root user to
> >> > bypass the restriction.
> >> >
> >> > Now, the current proposal actually doesn't support a root user starting
> >> > an application that it doesn't quite trust in such a way that it *is*
> >> > subject to the whitelist.
> >>
> >> Well, this is not hard since root process can spawn another process
> >> and loose privileges before creating user-ns to be controlled by the
> >> whitelist.
> >
> > It would have to drop cap_sys_admin for the container to be marked as
> > "controlled", which may prevent the container runtime from properly starting
> > the container.
> >
> Yes, but that's a conflict of trusted operations (that requires
> SYS_ADMIN) and untrusted processes it may spawn.

Not sure I understand what you're saying, but

I guess that in any case the task which is doing unshare(CLONE_NEWNS)
can drop cap_sys_admin first. Though that is harder if using clone,
and it is awkward because it's not the container manager, but the user,
who will judge whether the container workload should be restricted.
So the container driver will add a flag like "run-controlled", and
the driver will convert that to dropping a capability; which again
is weird. It would seem nicer to introduce a userns flag, 'caps-controlled'
For an unprivileged userns, it is always set to 1, and root cannot
change it. For a root-created userns, it stays 0, but root can set it
to 1 (using /proc file?). In this way a either container runtime or just an
admin script can say "no wait I want this container to still be controlled".

Or we could instead add a second sysctl to decide whether all or only
'controlled' user namespaces should be controlled. That's not pretty though.