NFS Server suspected!

Steven N. Hirsch (shirsch@ibm.net)
Sun, 14 Sep 1997 16:38:49 -0400 (EDT)


All,

Some gathered information on my ongoing NFS problems.

My test setup involves two machines:

cy:

6x86 Cyrix P150+
32M memory
1GB Fast-Wide SCSI drive
Adaptec 2940UW SCSI controller
NetGear 100Base adapter (Tulip 21140 chipset)

amd:

5x86 AMD 133Mhz.
24M memory
500MB Fast SCSI2 drive
NCR/Symbios clone SCSI controller
NetGear 100Base adapter

BOTH:

kernel 2.1.55 w/ NFS client & server compiled in
Donald Becker's latest beta Tulip driver (0.79)
machines are directly connected with a crossover cable.
libc 5.4.33
binutils 2.8.1.0.1
ld-linux.so.1.9.2
gcc 2.7.2 (w/ specs modified for -fno-strength-reduce)

SETUP:

Kernel sources for 2.1.55 are on cy in /usr/src/linux-2.1.55
All test builds are performed on amd

***************************************************************

Datapoint 1:

To eliminate autofs as a perturbing factor, cy:/ is manually mounted
on amd as /net/cy. A symlink named "linux" is created in amd's
/usr/src directory pointing at /net/cy/usr/src/linux-2.1.55.

Starting with a source tree that has undergone "make mrproper"
(executed on the server, cy), I do

make dep ; make compressed

on amd.

This build will proceed to completion, and produces a working kernel.

***************************************************************

Datapoint 2:

Eliminate the symlink indirection by making
/net/cy/usr/src/linux-2.1.55 the current working directory on the
client (amd)..

Now the fun begins!

First, running "make clean" fails when attempting to execute:

rm -f `find modules/ -type f -print`

The error message is from "find". It cheerfully lists all the
symlinks in the modules subdirectory one to a line, each followed by a
complaint that the file does not exist. The directory appears empty
when ls is run manually. Moving to the server, we find a directory
where all these entries exist as dangling symlinks (the target
files were removed in an earlier cleanup step).

To help the process along, I remove the bad links manually at the server
console. A re-run of "make clean" then succeeds.

Next, "make dep" is run on the client machine (amd). This completes
normally.

Finally, "make compressed". The actual build fails 100% of the time.
It will chug along for a while, then die with a message of:

foo.o: No space left on device
(standard input): Assembler messages:
(standard input):3186: FATAL: Can't write foo.o: No such file or directory
make[2]: ** [foo.o] Error 1

etc, etc..

Where the "foo" is not deterministic (and naturally the server
isn't really out of space).

A quick ls of the build directory from the client shows that about
half of the subdirectories under the root of the source tree are
inaccessible; displaying with a message of "Stale NFS file handle".

If I umount and re-mount the server, they spring back into existence.
However a restarted build will eventually fail again with the same
symptoms.

I've yet to have one complete.

******************************************************************

Datapoint 3:

Reboot cy under 2.0.31-pre9 with user-space NFS server.

Absolutely zero problems. I can build and rebuild to my heart's
content.

*******************************************************************

Datapoint 4:

Run _both_ machines under 2.0.31-pre9 with user-space nfs and old
client respectively.

Absolutely no problems, but noticeably slower.

********************************************************************

Datapoint 5:

The server is running 2.1.55, and the client (amd) is running
2.0.3x with the original NFS client.

With the server mounted as /net/cy, I cd to /net/cy/tmp and run the
iozone benchmark on "auto". It blows up 100% of the time (at random
points), producing an endless cycle of these messages at the server:

Sep 12 20:23:20 cy kernel: nfs: RPC call returned error 111
Sep 12 20:23:20 cy kernel: RPC: task of released request still queued!
Sep 12 20:23:20 cy kernel: RPC: (task is on xprt_pending)

At the client, I see:

Error writing block nnn
iozone: Connection refused

interspersed with echoed server complaints.

At this point, things are so fouled up that both boxes must be
rebooted to restore operation.

**********************************************************************

Datapoint 6:

As above, but with the server running the old server software.

Nary a problem...

**********************************************************************

Conclusions:

By elimination, I would conclude that something is broken in the
2.1.55 kernel-based NFS server.

For the umpteenth time:

This stuff is 100% repeatable. If there are any debugging hooks you would
like me to put in, or any patches to try, please say the word!

In the hope that this is of help.

Steve