Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

From: Eric W. Biederman
Date: Fri Jul 22 2016 - 14:58:26 EST


Colin Walters <walters@xxxxxxxxxx> writes:

> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote:
>>
>> This patchset addresses two use cases:
>> - Implement a sane upper bound on the number of namespaces.
>> - Provide a way for sandboxes to limit the attack surface from
>> namespaces.
>
> Perhaps this is obvious, but since you didn't quite explicitly state it;
> do you see this as obsoleting the existing downstream patches
> mentioned in:
> https://lwn.net/Articles/673597/
> It seems conceptually similar to Kees' original approach, right?

Similar yes, and I expect it fills the need. My primary difference is
that I believe this approach makes sense from a perspective of assuming
that user namespaces or other namespaces are not any buggier than any
other piece of kernel code and that people will use them.

I don't see these limits making sense from a perspective that user
namespaces are flawed and distro kernels should not have enabled them in
the first place. That was my perception right or wrong of Kees patches
and the related patches that landed in Ubuntu and Debian.

With Kees approach I could not see how to handle the case where some
applications on the system wanted user namespaces and others don't.
Which made it very nasty for future evolution and more deployment of
user namespaces. Being per user namespace these limits can be used to
sandbox applications without affecting the rest of the system.

> The high level makes sense to me...most interesting is
> per-userns sysctls. I'll note most current container managers
> mount /proc/sys read-only, and Docker specifically drops
> CAP_SYS_RESOURCE by default, so they'd likely need to learn
> how to undo that if one wanted to support recursive container usage.
> We'd probably need to evaluate the safety of having /proc/sys
> writable generally. (Also it's rather common to filter out CLONE_NEWUSER
> via seccomp, but that's easy to undo)

Just using a user namespace replaces most of those precautions.

> But that's the flip side - if we're aiming primarily for an upstreamable
> way to *limit* namespace usage, it seems sane to me.

Yes. The primary target is to stop applications that have gone buggy
and allocated a crazy number of namespaces. The secondary target
is to allow sandboxes to disable creation of additional namespaces.
Just set the limit to 0 and drop caps, or similarly set the limit
to 1 and create another fresh set of nested namespaces.

Eric