Re: [RFC v17][PATCH 17/60] pids 7/7: Define clone_with_pids syscall

From: Sukadev Bhattiprolu
Date: Tue Jul 28 2009 - 20:42:32 EST


Ccing Oleg Nesterov, Eric Biederman, Mike Waychinson, Ying Han

Note that this is a variant of an earlier clone_with_pids() interface, sent
in Mar 2009 (http://lkml.org/lkml/2009/3/13/359). Linus' major objection
was about security, as unprivileged tasks could read /var/run data and try
to run with cached pids.

This variant addresses that concern by requiring CAP_SYS_ADMIN to specify
pids. This makes sense since CAP_SYS_ADMIN is required to create a new
pid namespace anyway.

Sukadev

Oren Laadan [orenl@xxxxxxxxxxx] wrote:
| From: Sukadev Bhattiprolu <sukadev@xxxxxxxxxxxxxxxxxx>
|
| Container restart requires that a task have the same pid it had when it was
| checkpointed. When containers are nested the tasks within the containers
| exist in multiple pid namespaces and hence have multiple pids to specify
| during restart.
|
| clone_with_pids(), intended for use during restart, is the same as clone(),
| except that it takes a 'target_pid_set' paramter. This parameter lets caller
| choose specific pid numbers for the child process, in the process's active
| and ancestor pid namespaces. (Descendant pid namespaces in general don't
| matter since processes don't have pids in them anyway, but see comments
| in copy_target_pids() regarding CLONE_NEWPID).
|
| Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
| prevent unprivileged processes from misusing this interface.
|
| Call clone_with_pids as follows:
|
| pid_t pids[] = { 0, 77, 99 };
| struct target_pid_set pid_set;
|
| pid_set.num_pids = sizeof(pids) / sizeof(int);
| pid_set.target_pids = &pids;
|
| syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);
|
| If a target-pid is 0, the kernel continues to assign a pid for the process in
| that namespace. In the above example, pids[0] is 0, meaning the kernel will
| assign next available pid to the process in init_pid_ns. But kernel will assign
| pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
| 77 or 99 are taken, the system call fails with -EBUSY.
|
| If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
| the system call fails with -EINVAL.
|
| Its mostly an exploratory patch seeking feedback on the interface.
|
| NOTE:
| Compared to clone(), clone_with_pids() needs to pass in two more
| pieces of information:
|
| - number of pids in the set
| - user buffer containing the list of pids.
|
| But since clone() already takes 5 parameters, use a 'struct
| target_pid_set'.
|
| TODO:
| - Gently tested.
| - May need additional sanity checks in do_fork_with_pids().
|
| Changelog[v3]:
| - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
| in the target_pids[] list and setting it 0. See copy_target_pids()).
| - (Oren Laadan) Specified target pids should apply only to youngest
| pid-namespaces (see copy_target_pids())
| - (Matt Helsley) Update patch description.
|
| Changelog[v2]:
| - Remove unnecessary printk and add a note to callers of
| copy_target_pids() to free target_pids.
| - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
| - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
| 'num_pids == 0' (fall back to normal clone()).
| - Move arch-independent code (sanity checks and copy-in of target-pids)
| into kernel/fork.c and simplify sys_clone_with_pids()
|
| Changelog[v1]:
| - Fixed some compile errors (had fixed these errors earlier in my
| git tree but had not refreshed patches before emailing them)
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@xxxxxxxxxxxxxxxxxx>
| ---
| arch/x86/include/asm/syscalls.h | 2 +
| arch/x86/include/asm/unistd_32.h | 1 +
| arch/x86/kernel/entry_32.S | 1 +
| arch/x86/kernel/process_32.c | 21 +++++++
| arch/x86/kernel/syscall_table_32.S | 1 +
| kernel/fork.c | 108 +++++++++++++++++++++++++++++++++++-
| 6 files changed, 133 insertions(+), 1 deletions(-)
|
| diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
| index 372b76e..df3c4a8 100644
| --- a/arch/x86/include/asm/syscalls.h
| +++ b/arch/x86/include/asm/syscalls.h
| @@ -40,6 +40,8 @@ long sys_iopl(struct pt_regs *);
|
| /* kernel/process_32.c */
| int sys_clone(struct pt_regs *);
| +int sys_clone_with_pids(struct pt_regs *);
| +int sys_vfork(struct pt_regs *);
| int sys_execve(struct pt_regs *);
|
| /* kernel/signal.c */
| diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
| index 732a307..f65b750 100644
| --- a/arch/x86/include/asm/unistd_32.h
| +++ b/arch/x86/include/asm/unistd_32.h
| @@ -342,6 +342,7 @@
| #define __NR_pwritev 334
| #define __NR_rt_tgsigqueueinfo 335
| #define __NR_perf_counter_open 336
| +#define __NR_clone_with_pids 337
|
| #ifdef __KERNEL__
|
| diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
| index c097e7d..c7bd1f6 100644
| --- a/arch/x86/kernel/entry_32.S
| +++ b/arch/x86/kernel/entry_32.S
| @@ -718,6 +718,7 @@ ptregs_##name: \
| PTREGSCALL(iopl)
| PTREGSCALL(fork)
| PTREGSCALL(clone)
| +PTREGSCALL(clone_with_pids)
| PTREGSCALL(vfork)
| PTREGSCALL(execve)
| PTREGSCALL(sigaltstack)
| diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
| index 59f4524..9965c06 100644
| --- a/arch/x86/kernel/process_32.c
| +++ b/arch/x86/kernel/process_32.c
| @@ -443,6 +443,27 @@ int sys_clone(struct pt_regs *regs)
| return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
| }
|
| +int sys_clone_with_pids(struct pt_regs *regs)
| +{
| + unsigned long clone_flags;
| + unsigned long newsp;
| + int __user *parent_tidptr;
| + int __user *child_tidptr;
| + void __user *upid_setp;
| +
| + clone_flags = regs->bx;
| + newsp = regs->cx;
| + parent_tidptr = (int __user *)regs->dx;
| + child_tidptr = (int __user *)regs->di;
| + upid_setp = (void __user *)regs->bp;
| +
| + if (!newsp)
| + newsp = regs->sp;
| +
| + return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
| + child_tidptr, upid_setp);
| +}
| +
| /*
| * sys_execve() executes a new program.
| */
| diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
| index d51321d..879e5ec 100644
| --- a/arch/x86/kernel/syscall_table_32.S
| +++ b/arch/x86/kernel/syscall_table_32.S
| @@ -336,3 +336,4 @@ ENTRY(sys_call_table)
| .long sys_pwritev
| .long sys_rt_tgsigqueueinfo /* 335 */
| .long sys_perf_counter_open
| + .long ptregs_clone_with_pids
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 64d53d9..29c66f0 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -1336,6 +1336,97 @@ struct task_struct * __cpuinit fork_idle(int cpu)
| }
|
| /*
| + * If user specified any 'target-pids' in @upid_setp, copy them from
| + * user and return a pointer to a local copy of the list of pids. The
| + * caller must free the list, when they are done using it.
| + *
| + * If user did not specify any target pids, return NULL (caller should
| + * treat this like normal clone).
| + *
| + * On any errors, return the error code
| + */
| +static pid_t *copy_target_pids(void __user *upid_setp)
| +{
| + int j;
| + int rc;
| + int size;
| + int unum_pids; /* # of pids specified by user */
| + int knum_pids; /* # of pids needed in kernel */
| + pid_t *target_pids;
| + struct target_pid_set pid_set;
| +
| + if (!upid_setp)
| + return NULL;
| +
| + rc = copy_from_user(&pid_set, upid_setp, sizeof(pid_set));
| + if (rc)
| + return ERR_PTR(-EFAULT);
| +
| + unum_pids = pid_set.num_pids;
| + knum_pids = task_pid(current)->level + 1;
| +
| + if (!unum_pids)
| + return NULL;
| +
| + if (unum_pids < 0 || unum_pids > knum_pids)
| + return ERR_PTR(-EINVAL);
| +
| + /*
| + * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
| + * and set it to 0. This last entry in target_pids[] corresponds to the
| + * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
| + * specified. If CLONE_NEWPID was not specified, this last entry will
| + * simply be ignored.
| + */
| + target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
| + if (!target_pids)
| + return ERR_PTR(-ENOMEM);
| +
| + /*
| + * A process running in a level 2 pid namespace has three pid namespaces
| + * and hence three pid numbers. If this process is checkpointed,
| + * information about these three namespaces are saved. We refer to these
| + * namespaces as 'known namespaces'.
| + *
| + * If this checkpointed process is however restarted in a level 3 pid
| + * namespace, the restarted process has an extra ancestor pid namespace
| + * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
| + *
| + * During restart, the process requests specific pids for its 'known
| + * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
| + *
| + * Since the requested-pids correspond to 'known namespaces' and since
| + * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
| + * namespaces', copy requested pids to the back-end of target_pids[]
| + * (i.e before the last entry for CLONE_NEWPID mentioned above).
| + * Any entries in target_pids[] not corresponding to a requested pid
| + * will be set to zero and kernel assigns a pid in those namespaces.
| + *
| + * NOTE: The order of pids in target_pids[] is oldest pid namespace to
| + * youngest (target_pids[0] corresponds to init_pid_ns). i.e.
| + * the order is:
| + *
| + * - pids for 'unknown-namespaces' (if any)
| + * - pids for 'known-namespaces' (requested pids)
| + * - 0 in the last entry (for CLONE_NEWPID).
| + */
| + j = knum_pids - unum_pids;
| + size = unum_pids * sizeof(pid_t);
| +
| + rc = copy_from_user(&target_pids[j], pid_set.target_pids, size);
| + if (rc) {
| + rc = -EFAULT;
| + goto out_free;
| + }
| +
| + return target_pids;
| +
| +out_free:
| + kfree(target_pids);
| + return ERR_PTR(rc);
| +}
| +
| +/*
| * Ok, this is the main fork-routine.
| *
| * It copies the process, and if successful kick-starts
| @@ -1352,7 +1443,7 @@ long do_fork_with_pids(unsigned long clone_flags,
| struct task_struct *p;
| int trace = 0;
| long nr;
| - pid_t *target_pids = NULL;
| + pid_t *target_pids;
|
| /*
| * Do some preliminary argument and permissions checking before we
| @@ -1386,6 +1477,17 @@ long do_fork_with_pids(unsigned long clone_flags,
| }
| }
|
| + target_pids = copy_target_pids(pid_setp);
| +
| + if (target_pids) {
| + if (IS_ERR(target_pids))
| + return PTR_ERR(target_pids);
| +
| + nr = -EPERM;
| + if (!capable(CAP_SYS_ADMIN))
| + goto out_free;
| + }
| +
| /*
| * When called from kernel_thread, don't do user tracing stuff.
| */
| @@ -1453,6 +1555,10 @@ long do_fork_with_pids(unsigned long clone_flags,
| } else {
| nr = PTR_ERR(p);
| }
| +
| +out_free:
| + kfree(target_pids);
| +
| return nr;
| }
|
| --
| 1.6.0.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/