Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

From: Daniel Lezcano
Date: Mon Oct 19 2009 - 16:35:06 EST


Sukadev Bhattiprolu wrote:
Daniel Lezcano [daniel.lezcano@xxxxxxx] wrote:
Sukadev Bhattiprolu wrote:
Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call

To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed. When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.

This patchset implements a new system call, clone3() that lets a process
specify the pids of the child process.

Patches 1 through 7 are helper patches, needed for choosing a pid for the
child process.

PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
documentation on the new system call, some/all of which will eventually
go into a man page.
Sorry for jumping so late in the discussion and for having maybe my
remarks pointless...

If this syscall is only for checkpoint / restart, why this shouldn't be
used with a future generic sys_restart syscall ?

As I tried to explain in PATCH 0/9, the ability to choose a pid is only
for C/R but we are also trying to clone-flags so we won't need yet
another variant of clone() fairly soon.

Otherwise, shouldn't be more convenient to have something usable for
everyone, let's say:

cloneat(pid_t pid, pid_t desiredpid, ...);

Where 'desiredpid' is a hint of for the kernel for the pid to be
allocated (zero means the kernel will choose one for us) and the newly
allocated task is the son of 'pid'.

Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
P2 ? I did not know we had a requirement for that. Can you explain the
use-case more ? IOW, why can't P2 create the child P3 by itself ?
I forgot to mention a constraint with the specified pid : P2 has to be child of P1.
In other word, you can not specify a pid to clonat which is not your descendant (including yourself).
With this constraint I think there is no security issues.

Concerning of forking on behalf of another process, we can consider it is up to the caller / programmer to know what it does. If a process in the process hierarchy exec'ed a program and we cloneat this process and then the program fails because of an "unexpected error", well, we should have not done that. A similar example is when the IPC are removed while they are used by some other processes.

Here it is a interesting use case:
* if you created a pid namespace, and, let's say, booted a system container where the container init is the "init" process, then with this call you can enter the container at any time by doing cloneat() followed by an exec of your command. I think that was a requirement when there were discussions around "sys_hijack".

Another point. It's another way to extend the exhausted clone flags as the cloneat can be called as a compatibility way, with cloneat(getpid(), 0, ... )

Note also that 'desiredpid' must be a list of pids (one for each pid
namespaces that the child will belong to) and hence we need 'nr_pids'
to specify the list. Given that we are limited to 6 parameters to the
syscall, such parameters must be stuffed into 'struct clone_args'.

So we should do something like:

sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
pid_t *desired_pids)

or (to match the name and parameters, move 'pid' parameter into clone_args)
Well, hiding multiple clone in one clone call is ... weird. AFAIR, there was a debate between kernel or userspace proctree creation but it looks like it's done from the kernel with this call.

I don't really see a difference between sys_restart(pid_t pid , int fd, long flags) where pid_t is the topmost in the hierarchy, fd is a file descriptor to a structure "pid_t * + struct clone_args *" and flags is "PROCTREE".

IMHO, it is nicer to recursively restore the process tree for the nested pid namespaces, that will be really an userspace process tree creation and cloneat will be your friend here :)

That looks more consistent with the "<syscall>at" family, 'openat',
'faccessat', 'readlinkat', etc ... and usable for something else than
the checkpoint / restart.

The subtle difference though is that openat() does not open a file on
behalf of another process and so the 'at' suffix would not apply ?
Yes and no, depending of where you put the cursor. If you consider the 'at' suffix means a process context, then I agree with you, there is a difference because the cloneat will be out of the current process context. But if you consider the 'at' suffix as a context in general, and openat means "relatively to a file descriptor" and cloneat means "relatively to a pid namespace" the 'at' suffix may apply. But I agree that we are so used to call the posix "fork", that cloneat sounds scary :)

Thanks
-- Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/