wait4() problem

Raul Miller (rdm@tad.micro.umn.edu)
18 Mar 1996 03:52:27 GMT


I'm trying to get lpd to work on my machine, and haven't been able
to. It drops into a deadlock.

Now, I'm reconfiguring lpd, and I've written my own output filter, so
there's ample opportunity for pilot error here. Also, I've had
problems with strace so maybe my data is suspect. However, the gross
behavior seems to be the same, with or without strace(). And, what
strace shows sure looks like a kernel bug.

At this point, I've got lpd compiled -g, with a sleep (8, 7, 6 seconds
for each of the three forks) on the child side of every fork so strace
won't lose track (and so I can go in with gdb if I feel like it).
With the sleeps, there's no significant race. The failure mode seems
to be the same with my hacked lpd as without -- so I think I'm on the
right track here.

Here's what I think are the relevant excerpts from an strace.

[pid 1615] fork() = 1618
...
[pid 1615] write(8, "\31\1", 2) = 2
[pid 1615] wait4(-1, <unfinished ...>

This is the last sign of life I see from 1615, it never comes back
from that wait4. Here's the source code for the spot where it hangs:

write(ofd, "\031\1", 2);
while ((pid =
wait3((int *)&status, WUNTRACED, 0)) > 0 && pid != ofilter)
;

[pid 1618] fork() = 1619
...
[pid 1618] kill(1618, SIGSTOP) = 0
[pid 1618] --- SIGSTOP (Stopped (signal)) ---
[pid 1618] read(0, <unfinished ...>

1618's fd 0 is the other side of a pipe from 1615's fd 8, so of course
1618 stalls here.

What I'm wondering is: why didn't 1615 wake up when 1618 put itself to
sleep? Why did 1618 wake up?

Kernel is 1.3.75 (with ncp support turned on and in use).

libc is the debian libc5-5.2.18-1

gcc is the debian gcc-2.7.2-5

I can supply more configuration details if anyone wants, including the
source to my output filter if anyone thinks that's important.

Wondering if the problem is triggered by the filter forking a child, I
just re-wrote the output filter so it won't fork the child till it
needs it. [This is cleaner, too.] The problem still occurs:

[pid 2128] fork() = 2131
...
[pid 2128] open("dfA004Aa01477", O_RDONLY) = 9
[pid 2128] write(8, "\31\1", 2) = 2
[pid 2128] wait4(-1, <unfinished ...>

Again, this is the last sign of activity from this fork of lpd.

[pid 2131] personality(0) = 0
[pid 2131] read(0, "\31\1", 8192) = 2
[pid 2131] getpid() = 2131
[pid 2131] kill(2131, SIGSTOP) = 0
[pid 2131] --- SIGSTOP (Stopped (signal)) ---
[pid 2131] read(0,

And, of course, it hangs here. The only visible difference is that
without a child process this is the last line that strace displays
(unless I interrupt things manually).

-- 
Raul