Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics

From: Tejun Heo
Date: Thu Jun 01 2017 - 11:35:40 EST


Hello, Peter.

On Thu, Jun 01, 2017 at 05:10:45PM +0200, Peter Zijlstra wrote:
> I've not had time to look at any of this. But the question I'm most
> curious about is how cgroup-v2 preserves the container invariant.
>
> That is, each container (namespace) should look like a 'real' machine.
> So just like userns allows to have a uid-0 (aka root) for each container
> and pidns allows a pid-1 for each container, cgroupns should provide a
> root group for each container.
>
> And cgroup-v2 has this 'exception' (aka wart) for the root group which
> needs to be replicated for each namespace.

The goal has never been that a container must be indistinguishible
from a real machine. For certain things, things simply don't have
exact equivalents due to sharing (memory stats or journal writes for
example) and those things are exactly why people prefer containers
over VMs for certain use cases. If one wants full replication, VM
would be the way to go.

The goal is allowing enough container invariant so that appropriate
workloads can be contained and co-exist in useful ways. This also
means that the contained workload is usually either a bit illiterate
w.r.t. to the system details (doesn't care) or makes some adjustments
for running inside a container (most quasi-full-system ones already
do).

System root is inherently different from all other nested roots.
Making some exceptions for the root isn't about taking away from other
roots but more reflecting the inherent differences - there are things
which are inherently system / bare-metal.

Thanks.

--
tejun