Bug in waitpid()

Fri, 1 Nov 1996 03:11:02 +0100 (CET)


when debugging a problem I had with a version of xterm that still has
logging, I discovered the following behaviour:

-Xterm sets a default children reaper, which on a SIGCHLD does a wait and
-When you turn on logging, it forks a process to open a file with certain
permissions. In the mean time the parent does a waitpid for this child

When the child dies, the waitpid returns, but SIGCHLD is still generated,
and so the reaper calls wait, but of course nothing ever happens, since that
child has already been waited for, so the program hangs.

(all this is under kernel 2.1.6, libc 5.4.7)

Since I have no access to the POSIX standards, I am not sure what the correct
semantics are, but from books I get the impression that SIGCHLD means more
something like "a child that still has to be waited for died" rather than
"a child died". So if my interpretation is correct, that would indicate a
kernel bug (the other interpretation would be very annoying. You would be
forced to temporarily turn off the SIGCHLD handler, but other children could
exit before you started the wait, so you could miss exits. On the other hand,
POSIX definitely is not afraid of annoying semantics and race conditions).

The relevant part of an strace:
[pid 612] fork() = 620
[pid 612] wait4(620, <unfinished ...> <- parent waits for child
[pid 620] setgid(0) = 0
[pid 620] setuid(0) = 0
[pid 620] open("XtermLog.a00612", O_WRONLY|O_APPEND|O_CREAT, 0644) = 9
[pid 620] close(9) = 0
[pid 620] _exit(0) = ?
[pid 612] <... wait4 resumed> NULL, 0, NULL) = 620 <- parent gets child return (this really a waitpid(pid, NULL, 0))
[pid 612] --- SIGCHLD (Child exited) --- (SIGCHLD is generated)
[pid 612] wait4(-1, (reaper waits and never returns)
(this is a wait(NULL))