Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZdo?

From: Serge E. Hallyn
Date: Fri Mar 13 2009 - 12:36:28 EST


Quoting Cedric Le Goater (legoater@xxxxxxx):
>
> > No, what you're suggesting does not suffice.
>
> probably. I'm still trying to understand what you mean below :)
>
> Man, I hate these hierarchicals pid_ns. one level would have been enough,
> just one vpid attribute in 'struct pid*'

Well I don't mind - temporarily - saying that nested pid namespaces
are not checkpointable. It's just that if we're going to need a new
syscall anyway, then why not go ahead and address the whole problem?
It's not hugely more complicated, and seems worth it.

> > Call
> > (5591,3,1) the task knows as 5591 in the init_pid_ns, 3 in a child pid
> > ns, and 1 in grandchild pid_ns created from there. Now assume we are
> > checkpointing tasks T1=(5592,1), and T2=(5594,3,1).
> >
> > We don't care about the first number in the tuples, so they will be
> > random numbers after the recreate.
>
> yes.
>
> > But we do care about the second numbers.
>
> yes very much and we need a way set these numbers in alloc_pid()
>
> > But specifying CLONE_NEWPID while recreating the process tree
> > in userspace does not allow you to specify the 3 in (5594,3,1).
>
> I haven't looked closely at hierarchical pid namespaces but as we're
> using a an array of pid indexed but the pidns level, i don't see why
> it shouldn't be possible. you might be right.
>
> anyway, I think that some CLONE_NEW* should be forbidden. Daniel should
> send soon a little patch for the ns_cgroup restricting the clone flags
> being used in a container.

Uh, that feels a bit over the top. We want to make this
uncheckpointable (if it remains so), not prevent the whole action.
After all I may be running a container which I don't plan on ever
checkpointing, and inside that container running a job which i do
want to migrate.

So depending on if we're doing the Dave or the rest-of-the-world
way :), we either clear_bit(pidns->may_checkpoint) on the parent
pid_ns when a child is created, or we walk every task being
checkpointed and make sure they each are in the same pid_ns. Doesn't
that suffice?

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/