Re: [PATCH] procfs: Do not release pid_ns->proc_mnt too early

From: Louis Rilling
Date: Mon Jun 21 2010 - 10:14:28 EST


On 21/06/10 5:58 -0700, Eric W. Biederman wrote:
> Louis Rilling <Louis.Rilling@xxxxxxxxxxx> writes:
>
> > On 18/06/10 18:27 +0200, Oleg Nesterov wrote:
> >> On 06/18, Louis Rilling wrote:
> >> >
> >> > On 17/06/10 23:36 +0200, Oleg Nesterov wrote:
> >> > > On 06/17, Eric W. Biederman wrote:
> >> > > >
> >> > > > The task->children isn't changed until __unhash_process() which runs
> >> > > > after flush_proc_task().
> >> > >
> >> > > Yes. But this is only the current implementation detail.
> >> > > It would be nice to cleanup the code so that EXIT_DEAD tasks are
> >> > > never sit in ->children list.
> >> > >
> >> > > > So we should be able to come up with
> >> > > > a variant of do_wait() that zap_pid_ns_processes can use that does
> >> > > > what we need.
> >> > >
> >> > > See above...
> >> > >
> >> > > Even if we modify do_wait() or add the new variant, how the caller
> >> > > can wait for EXIT_DEAD tasks? I don't think we want to modify
> >> > > release_task() to do __wake_up_parent() or something similar.
> >> >
> >> > Indeed, I was thinking about calling __wake_up_parent() from release_task()
> >> > once parent->children becomes empty.
> >> >
> >> > Not sure about the performance impact though. Maybe some WAIT_NO_CHILDREN flag
> >> > in parent->signal could limit it. But if EXIT_DEAD children are removed from
> >> > ->children before release_task(), I'm afraid that this becomes impossible.
> >>
> >> Thinking more, even the current do_wait() from zap_pid_ns_processes()
> >> is not really good. Suppose that some none-init thread is ptraced, then
> >> zap_pid_ns_processes() will hange until the tracer does do_wait() or
> >> exits.
> >
> > Is this really a bad thing? If somebody ptraces a task in a pid namespace, that
> > sounds reasonable to have this namespace (and it's init task) pinned.
>
> Louis. Have you seen this problem hit without my setns patch?

Yes. I hit it with Kerrighed patches. I also have an ugly reproducer on
2.6.35-rc3 (see attachments). Ugly because I introduced artifical delays
in release_task(). I couldn't trigger the bug without it, probably because the
scheduler is too kind :)

I'm using memory poisoining (SLAB and DEBUG_SLAB) to make it easy to observe the
bug.

Example:
# ./proc_flush_task-bug-reproducer 1

>
> I'm pretty certain that this hits because there are processes do_wait
> does not wait for, in particular processes in a disjoint process tree.

Indeed do_wait() misses EXIT_DEAD children.

>
> So at this point I am really favoring killing the do_wait and making
> this all asynchronous.

Any idea about how to do it?

Thanks,

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <signal.h>
#include <linux/sched.h>

int pipefd[2];

int init(void *arg)
{
int nr, i, err;
sighandler_t sigret;
char c;

close(pipefd[0]);

err = setsid();
if (err < 0) {
perror("setsid");
abort();
}

sigret = signal(SIGCHLD, SIG_IGN);
if (sigret == SIG_ERR) {
fprintf(stderr, "signal\n");
abort();
}

nr = atoi(arg);
for (i = 0; i < nr; i++) {
err = fork();
if (err < 0) {
perror("fork");
abort();
} else if (err == 0) {
printf("%d before\n", getpid());
fflush(stdout);
pause();
printf("%d after\n", getpid());
fflush(stdout);
return 0;
}
}

err = write(pipefd[1], &c, 1);
if (err != 1) {
perror("write");
abort();
}

pause();

return 0;
}

int main(int argc, char *argv[])
{
long stack_size = sysconf(_SC_PAGESIZE);
void *stack = alloca(stack_size) + stack_size;
pid_t pid;
char c;
int ret;

ret = pipe(pipefd);
if (ret) {
perror("pipe");
abort();
}

ret = clone(init, stack, CLONE_NEWPID | SIGCHLD, argv[1]);
if (ret < 0) {
perror("clone");
abort();
}
pid = ret;
printf("%d\n", pid);
fflush(stdout);

close(pipefd[1]);
ret = read(pipefd[0], &c, 1);
if (ret != 1) {
if (ret) {
perror("read");
abort();
} else {
sleep(5);
}
}

printf("killing %d\n", pid);
fflush(stdout);
ret = kill(-pid, SIGKILL);
if (ret) {
perror("kill");
abort();
}

return 0;
}
commit 7b7cae6ae5c543b8e9cc84fc041d9bce36e7b674
Author: Louis Rilling <louis.rilling@xxxxxxxxxxx>
Date: Wed Jun 16 16:20:02 2010 +0200

proc_flush_task() debug

diff --git a/kernel/exit.c b/kernel/exit.c
index ceffc67..be8cdb0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -169,6 +169,14 @@ repeat:
atomic_dec(&__task_cred(p)->user->processes);
rcu_read_unlock();

+ if (task_pid(p)->level > 0) {
+ if (!thread_group_leader(p) || !is_container_init(p)) {
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_timeout(10 * HZ);
+ }
+ printk("release_task: %d/%d\n", p->pid, task_pid(p)->numbers[1].nr);
+ }
+
proc_flush_task(p);

write_lock_irq(&tasklist_lock);

Attachment: signature.asc
Description: Digital signature