Re: [PATCH 3/3] rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()

From: Frederic Weisbecker
Date: Wed Dec 07 2022 - 15:03:06 EST


On Tue, Dec 06, 2022 at 05:49:28PM +0100, Oleg Nesterov wrote:
> On 11/30, Eric W. Biederman wrote:
> >
> > 2) I keep thinking zap_pid_ns_processes() should be changed so that
> > after it sends SIGKILL to all of the relevant processes to not wait,
>
> At least I think it should not wait for the tasks injected into this ns.
>
> Because this looks like a kernel bug even if we forget about this deadlock.
>
> Say we create a task P using clone(CLONE_NEWPID), then inject a task T into
> P's pid-namespace via setns/fork. This make the process P "unkillable", it
> will hang in zap_pid_ns_processes() "forever" until T->parent reaps a zombie
> task T killed by P.

I think this was made that way on purpose, see the comment in
zap_pid_ns_processes():

/*
* kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE
* process whose parents processes are outside of the pid
* namespace. Such processes are created with setns()+fork().
*
* If those EXIT_ZOMBIE processes are not reaped by their
* parents before their parents exit, they will be reparented
* to pid_ns->child_reaper. Thus pidns->child_reaper needs to
* stay valid until they all go away.
*
* The code relies on the pid_ns->child_reaper ignoring
* SIGCHILD to cause those EXIT_ZOMBIE processes to be
* autoreaped if reparented.
*
* Semantically it is also desirable to wait for EXIT_ZOMBIE
* processes before allowing the child_reaper to be reaped, as
* that gives the invariant that when the init process of a
* pid namespace is reaped all of the processes in the pid
* namespace are gone.

I can't say I like the fact that a parent not belonging to a new namespace
can create more than one child within that namespace but anyway this all look
like an ABI that can't be reverted now.

Thanks.