Re: Revamped NFS async IO code

Mike Castle (mcastle@umr.edu)
Sun, 5 May 1996 18:02:37 -0500 (CDT)


Amazingly enough Olaf Kirch said:
> I've just uploaded a modified version of the NFS client code to
> ftp://ftp.mathematik.th-darmstadt.de/pub/linux/okir/dontuse .
> It features the following enhancements over the current version:

Hmmm. I seem to be having problems with 0.3.12 as a module.

Kernel 1.3.98
Compiled with gcc 2.7.2.l.3 withOUT -fno-strength-reduce

I try to use nvi to edit a file on an NFS mounted partition
(mounted off of a similar setup with server 2.2beta14 if that
makes a difference). I did turn on NFS client debugging by doing
'ls xyzzy' on nfs partition. I suddenly get the following:

May 5 08:05:15 thune kernel: NFS call lookup a.c
May 5 08:05:15 thune kernel: NFS reply lookup
May 5 08:05:15 thune kernel: NFS call getattr
May 5 08:05:15 thune kernel: NFS reply getattr
May 5 08:05:15 thune kernel: NFS reqst read 4096 @ 0
May 5 08:05:15 thune kernel: Unable to handle kernel NULL pointer dereference at virtual address c0000000
May 5 08:05:15 thune kernel: current->tss.cr3 = 003ac000, \r3 = 003ac000
May 5 08:05:15 thune kernel: *pde = 00102067
May 5 08:05:15 thune kernel: *pte = 00000027
May 5 08:05:15 thune kernel: Oops: 0000
May 5 08:05:15 thune kernel: CPU: 0
May 5 08:05:15 thune kernel: EIP: 0010:[<01016258>]
May 5 08:05:15 thune kernel: EFLAGS: 00010246
May 5 08:05:15 thune kernel: eax: 00000000 ebx: 00002000 ecx: 00000000 edx: 003e8890
May 5 08:05:15 thune kernel: esi: 00098d04 edi: 0036ae28 ebp: 00090d04 esp: 00362de8
May 5 08:05:15 thune kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
May 5 08:05:15 thune kernel: Process vi (pid: 124, process nr: 18, stackpage=00362000)
May 5 08:05:15 thune kernel: Stack: 00000000 00090ce4 00360b24 001958c0 00000000 001aba38 00346018 00362e28
May 5 08:05:15 thune kernel: 00090ce4 003e8810 00000005 00000000 00000246 00346034 01015368 0034609c
May 5 08:05:15 thune kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
May 5 08:05:15 thune kernel: Call Trace: [<01015368>]
May 5 08:05:15 thune kernel: Code: 8b 00 89 42 10 f6 42 2c 01 75 27 8b 4c 24 24 8b 81 50 06 00
May 5 08:05:17 thune kernel: int3: 0000
May 5 08:05:17 thune kernel: CPU: 0
May 5 08:05:17 thune kernel: EIP: 0010:[<01014009>]
May 5 08:05:17 thune kernel: EFLAGS: 00000a03
May 5 08:05:17 thune kernel: eax: 001958c0 ebx: 00000004 ecx: 0001fecc edx: 0101bdcc
May 5 08:05:17 thune kernel: esi: 0001ff10 edi: 002af4c4 ebp: 0001ff30 esp: 0001fe90
May 5 08:05:17 thune kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
May 5 08:05:17 thune kernel: Process bash (pid: 94, process nr: 4, stackpage=0001f000)
May 5 08:05:17 thune kernel: Stack: 01018105 001958c0 002af56c 0001fea8 0001ff10 0001fecc 746f6f72 00000000
May 5 08:05:17 thune kernel: 002af4c4 0001ffb4 0001ff7c 0001fecc 0001ff10 0001fea8 0001feb0 00000000
May 5 08:05:17 thune kernel: 318ca97f 00000000 318ca97f 00000000 70025599 00f67a02 00000000 00000000
May 5 08:05:17 thune kernel: Call Trace: [<01018105>] [lookup+217/240] [follow_link+72/104] [permission+98/148] [lookup+217/240] [_namei+77/180] [namei+44/68]
May 5 08:05:17 thune kernel: [sys_newstat+39/84] [system_call+89/160]
May 5 08:05:17 thune kernel: Code: 30 19 00 90 24 00 00 18 d1 1b 00 6e 82 16 00 4c 34 19 00 03

After the NULL dereference, everything on the system is unstable.
Many file accesses (esp by root) fail, and often kill the shell
(ie, I try to 'more filename' and the shell dies). The
application dying, I could see. But to kill the parent process?
This was the cause of the second dump above.

At anyrate, location <01015368> *appears* to correspond to
nfs_rpc_header+68. However, following one of Linus' recent post
on using this info (ie, the Code: section), I couldn't find
anything resembling that in the assmbler output.

If I compile nfs into the kernel, this does not happen.
If I kill the nfsiod BEFORE trying to run vi, this does not
happen.

It only appears to happen when an application tries to lock the
file on the server (which I assume fails due to lack of lockd
anywhere?). This comes from running strace on the process. This
seems to happen with either fcntl() or flock(), though may be
wrong here. All the recent changes that affect strace makes me a
bit unsure about the output.

Meanwhile, I'm going to go study for tomorrow's final. After
that, I hope to do a few things:

1) Figure out how to make kerneld call insmod with -m and put it
somewhere useful (I think Bj0rn posted something about this, so
need to check linux-kernel archives).

2) Get sysklogd patches that try to work with modules as well as
regular kernel causing dumps.

3) Recompile nvi and step through it with gdb. I want to try to
determine if the problem occurs on the first read, or what (I
assume the nfsiod doesn't do any read ahead on just an open, but
rather waits until the first read?)

4) Recompile nfs client WITH -fno-strength-reduce. 2.7.2.l.3 is
*supposed* to have that fixed, but one never knows. May also try
2.7.2.

This is reproducible, so it should be trackable. Since it's a
NULL dereference, I suppose I'll put in a lot of printk's in
nfs_rpc_header and see if anything else triggers it.

Anything else you think I should try?

mrc

-- 
Mike Castle .-=NEXUS=-.  Life is like a clock:  You can work constantly
  mcastle@cs.umr.edu     and be right all the time, or not work at all
   mcastle@umr.edu       and be right at least twice a day.  -- mrc
    We are all of us living in the shadow of Manhattan.  -- Watchmen