select() not returning though pipe became readable

From: Lutz Vieweg
Date: Thu Mar 24 2005 - 10:54:17 EST


Hi everyone,

I'm currently investigating the following problem, which seems to indicate
a misbehaviour of the kernel:

A server software we implemented is sporadically "hanging" in a select()
call since we upgraded from kernel 2.4 to (currently) 2.6.9 (we have to wait
for 2.6.12 before we can upgrade again due to the shared-mem-not-dumped-into-
core-files problem addressed there).

What's suspicious is that whenever we attach with gdb to such a hanging process,
we can see that a pipe, whose file-descriptor is definitely included in the
fd_set "readfds" (and "n" is also high enough) has a byte in it available for
reading - and just leaving gdb again is enough to let the server continue just
fine.

We are using that pipe, which is known only to the same one process, to cause
select() to return immediately if a signal (SIGUSR1) had been delivered to the
process (by another process), there's a signal handler installed that does
nothing but a (non-blocking) write of 1 byte to the writing end of the pipe.

This mechanism worked fine before kernel 2.6, and it is still working in 99.99% of
the cases, but under heavy load, every few hours, we'll see the hanging select()
as mentioned above.

I noticed a recent thread at lkml about poll() and pipes, but that seems to address a
different issue, where there are more events reported than occured, what we
see is quite the opposite, we want select() to return on that pipe becoming readable...

Any ideas?
Any hints on what to do to investigate the problem further?

Regards,

Lutz Vieweg


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/