Re: nfsd hang and kernel bug in 2.6.35-rc3

From: Chris Vine
Date: Thu Jun 17 2010 - 06:38:50 EST


On Wed, 16 Jun 2010 20:44:15 -0400
Jeff Layton <jlayton@xxxxxxxxxx> wrote:
[snip]
> I stand corrected then. That's pretty close to the nfsd that I've been
> testing. I pulled down the nfsd init script and the only thing that
> looks substantially different is that it sends signals to nfsd to shut
> it down rather than just running "rpc.nfsd 0". That should work fine,
> however.
>
> Still I think the problem is basically something like what I've
> described. You ended up somehow with sockets on the sv_permsocks list
> that didn't hold lockd references. The way I described is one way that
> could occur. Another seems to be __write_ports_addxprt (which I think
> is clearly broken in light of this)...
>
> The root cause of this however is likely to be related to this
> problem:
>
> > Jun 15 16:07:18 laptop kernel: svc: failed to register lockdv3 RPC
> > service (errno 110). Jun 15 16:07:18 laptop kernel: lockd_up:
> > makesock failed, error=-110
>
> ...which means that the kernel couldn't talk to portmap or rpcbind.
> Maybe it wasn't up at the time? Or a problem with firewalling?

My initial reaction was "of course it is up" but your mention of
portmap sent me investigating with interesting results. I was going
to say "of course its is up" because the standard start-up script for
nfsd (rc.nfsd) checks whether rpc.portmap and rpc.statd are running, if
not starts them, and then starts exportfs, rpc.rquotad, rpc.nfsd and
rpc.mountd.

However, if I start portmap and statd early on so they do not rely on
the nfsd start-up script, then nfsd starts fine, so it seems to be a
timing thing notwithstanding that they are all started (at user level)
sequentially and in the same thread/process.

The timing problem does not arise on kernel-2.6.34 and earlier. Nor
does it arise on my pentium uniprocessor machine with kernel 2.6.35-rc3,
so it could well be core/thread related. It looks as if something in
the kernel has changed on that in 2.6.35 which provokes the kernel bug
report if timing is wrong. (If timing is wrong and if this is a user
tools rather than a kernel deficiency, and I express no view on that,
then I suppose it probably needs to be handled more gracefully in the
kernel.)

Chris


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/