kerneld/multicast bug (tickled by gated)

Brian Candler (B.Candler@pobox.com)
Sun, 15 Jun 1997 21:16:51 +0800 (SGT)


Greetings,

I am writing this from Malaysia, where I am very privileged to be working
alongside Sue Hares (of gated fame) and other experts at the Inet'97
developing countries workshop.

This year, for a number of reasons, we have decided to use Linux as the
teaching platform for Track 1 (as opposed to FreeBSD in previous years). We
have a lab of 23 Linux boxes, running Red Hat 4.1 + some packages from 4.2 +
a custom 2.0.30 kernel with Multicast enabled. A nice side effect is that I
have been able to persuade Sue to make gated compile and run cleanly under
Linux :-)

In running gated on this network, we have discovered a Linux kernel/kerneld
bug, and I wonder if anyone on this list might be able to shed some light on
it (or even propose a fix)

The symptoms are as follows:

1. gated sometimes hangs.

2. When it is in the hung state, it can still be made to dump core
('gdc COREDUMP'), and gdb shows that it was waiting in setsockopt() for
an IP_ADD_MEMBERSHIP request.

3. At the same time as gated hangs, an extra process is in the kernel table:
request-route <zombie>
which is a child of kerneld

4. If you kill kerneld, gated suddenly wakes up again (if you didn't make
it dump core first, that is)

Our solution for the workshop is simple - we run the whole system without
kerneld, and everything is fine. Perhaps we could instead delete or rename
/sbin/request-route. It would be nice to get to the root of this problem
though, and for me there are a number of questions:

- why is the kernel telling kerneld to invoke a userland routing script
when you change the interface multicast group list?

- why is kerneld not reaping its child?

- why is setsockopt blocking on kerneld?

Red Hat 4.1 has the modules-2.0.0 package. As far as I can see, kerneld
forks before execlp()ing the script so I have no idea how this can cause the
blocking.

Thanks for any ideas you have...

Regards,

Brian Candler.