Re: Strange 'zombie' problem both in 2.4 and 2.6

From: Denis Vlasenko
Date: Thu Apr 01 2004 - 11:11:37 EST


> > > It looks like at some moment kernel looses the abitily to inform
> > > process that their threads are over. AKAIK, this is done by SIGCHLD?
> > > Anyway, manual sending SIGCHLD to the parent of zombies does not help.
> >
> > Did you try stracing parent process? It can receive SIGCHLD but
> > ignore/mishandle it.
>
> I tried to use strace -f, so all threads exist in the output. No signals
> arrive, expect those send manually by kill().
> Stracing same binary on another host shows that SIGRT_1 arrives to the
> parent.
> I may send the strace logs, but they are somewhat large.
> So kernel really stops devivering signals.

Post reasonably small pieces of them.

> As far as I understand, in case of threads SIGRT_1 is used instead of
> SIGCHLD.
> So I tried to send SIGRT_1 to the parent manually. And zombies disappeared!
> However, new zombies appear soon. They may still be removed by manual
> SIGRT_1, but it is not a solution for a kernel bug :).

Maybe. Maybe not. I am no expert, I'd try to learn out how SIGRT_1
is generated in normal case (I suppose kernel does not distinguish
between threads and processes, maybe it's done by threading libs?)

> > Probably they get reparented to init and it wait()'s for them,
> > ending their afterlife. So SIGCHLD works (at least in this case).
>
> Seems that signal passing works only after reparenting zombies.
>
> > > Unfortunately, this did not eliminate the problem: it happened today
> > > again. The difference is that when running in 2.6, most binaries use
> > > NPTL libs from /lib/i686/cmov/, and seem not to be affected by the
> > > problem (i.e. no zombies from them). However, users need to run some
> > > statically-linked binaries (without source available) that have
> > > non-NPTL libs statically linked and so still use linuxthreads; those
> > > are affected (i.e. do create zombies). So problem is not rendering
> > > server unusable (so it no longer that critical), but it still exists
> > > in the 2.6 kernel.
> >
> > Sounds like userspace problem in threading libraries.
> > What version of glibc/linuxthreads was in use before?
> > Maybe post your report on linuxthreads mailing list.
>
> I doubt it is a userspace problem.
> It happens with the same userspace libs and binaries (or even same running
> processes) with which it did not happen sometime ago.
> It happens at the same moment with different processes running from
> different accounts.
> Restarting processes doesn't help.
> It is not reprodusable on other hosts.
> Manual signal send (kill -33 <parentpid>) removes already existing zombies.
> I can hardly imagine a userspace problem that behaves like this.

I won't argue. One thing is clear: not enough info at this time :(

Try to instrument (printk("...")) parts of kernel responsible for
handling exit() etc.
--
vda
>
> Nikita

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/