More waitpid issues with CLONE_DETACHED/CLONE_THREAD

From: Daniel Jacobowitz
Date: Sat Jan 31 2004 - 22:26:28 EST


This may be related to the python bug reported today...

I've been playing around with gdbserver support for NPTL threading all day
today. Right now it works, except that when I say "kill" in the GDB client,
gdbserver hangs. The problem is that we kill the child, and wait for it,
but wait never returns it.

write(2, "Killing inferior\n", 17) = 17
ptrace(PTRACE_CONT, 8454, 0, SIG_0) = 0
tkill(8454, SIGKILL) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
waitpid(8454, 0xbfffec04, WNOHANG) = 0
waitpid(8454, 0xbfffec04, WNOHANG|__WCLONE) = -1 ECHILD (No child processes)
nanosleep({0, 1000000}, 0) = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
setup() = 0
waitpid(8454, 0xbfffec04, WNOHANG) = 0
waitpid(8454, 0xbfffec04, WNOHANG|__WCLONE) = -1 ECHILD (No child processes)

and so on (looping on waitpid). At this point, process 8454 is marked as a
zombie, and nothing can reap it. After gdbserver is killed, it reparents to
init:

Name: linux-dp
State: Z (zombie)
SleepAVG: 91%
Tgid: 8454
Pid: 8454
PPid: 1
TracerPid: 0

but init can't reap it either.

8454 was the original (i.e. non-CLONE_DETACHED) thread. Same behavior if I
use ptrace_kill.

GDB doesn't suffer from the same problem. A little time with strace and I
found out why: GDB PTRACE_KILL's the detached threads, PTRACE_KILL's the
parent thread, waitpid's the detached threads, and then waitpid's the parent
thread. No design, just different order of items on the linked list.

If I change gdbserver to do "kill thread; wait for thread; kill next thread;
wait for next thread; kill parent last; wait for parent last" then it
terminates and I don't get an unkillable zombie.

ptrace(PTRACE_KILL, 18348, 0, 0) = 0
waitpid(18348, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes)
--- SIGCHLD (Child exited) @ 0 (0) ---
waitpid(18348, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18348
ptrace(PTRACE_KILL, 18349, 0, 0) = 0
waitpid(18349, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes)
waitpid(18349, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18349
--- SIGCHLD (Child exited) @ 0 (0) ---
ptrace(PTRACE_KILL, 18350, 0, 0) = 0
waitpid(18350, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes)
waitpid(18350, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18350
ptrace(PTRACE_KILL, 18351, 0, 0) = 0
waitpid(18351, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes)
waitpid(18351, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18351
ptrace(PTRACE_KILL, 18352, 0, 0) = 0
waitpid(18352, 0xbfffec04, WNOHANG) = -1 ECHILD (No child processes)
waitpid(18352, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG|__WCLONE) = 18352
--- SIGCHLD (Child exited) @ 0 (0) ---
ptrace(PTRACE_KILL, 18329, 0, 0) = 0
waitpid(18329, [WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL], WNOHANG) = 18329
exit_group(0) = ?

So it looks like something gets very confused if the parent is SIGKILLed
before the children. What should happen?

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/