More on NFS problems in 2.1.56 - oops enclosed.

David Woodhouse (D.W.Woodhouse@nortel.co.uk)
Wed, 24 Sep 1997 15:14:03 +0100


kernel 2.1.56, no changes that should affect NFS or file locking (tunnel
drivers, extra SNMP stats, Joliet, 3c59x.c v0.46A)

linux-nfs-0.4.21, recompiled yesterday against the new kernel.
gcc-2.7.2.1
libc-5.4.33

No activity, NFS or otherwise, at the time (that I am aware of), just...

Unable to handle kernel NULL pointer dereference at virtual address 00000004
current->tss.cr3 = 00101000, (r3 = 00101000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c01762bf>]
EFLAGS: 00010202
eax: 00000002 ebx: c3d5e000 ecx: 00000000 edx: 00000307
esi: c389fcb4 edi: c389fd34 ebp: c389fc00 esp: c3d5ffbc
ds: 0018 es: 0018 ss: 0018
Process lockd (pid: 10589, process nr: 72, stackpage=c3d5f000)
Stack: c3d5e000 c3d5e000 c389fc00 00000000 c3d5e000 00000000 c014774a 00000000
c389fc00 00000100 c3abbe78 c389fc00 c084d180 c0174584 c389fc00 c0147648
00000001
Call Trace: [<c014774a>] [<c0174584>] [<c0147648>]
Code: 8b 59 04 85 db 74 1e 89 ce 83 c6 04 8b 03 8b 53 04 39 c3 74

Using `/boot/System.map' to map addresses to symbols.

>>EIP: c01762bf <svc_recv+af/338>
Trace: c014774a <lockd+102/1cc>
Trace: c0174584 <svc_create_thread+c0/f8>
Trace: c014774a <lockd+102/1cc>

Code: c01762bf <svc_recv+af/338>
Code: c01762bf <svc_recv+af/338> 8b 59 04 movl 0x4(%ecx),%ebx
Code: c01762c2 <svc_recv+b2/338> 85 db testl %ebx,%ebx
Code: c01762c4 <svc_recv+b4/338> 74 1e je c01762e4
<svc_recv+d4/338>
Code: c01762c6 <svc_recv+b6/338> 89 ce movl %ecx,%esi
Code: c01762c8 <svc_recv+b8/338> 83 c6 04 addl $0x4,%esi
Code: c01762d1 <svc_recv+c1/338> 8b 03 movl (%ebx),%eax
Code: c01762d3 <svc_recv+c3/338> 8b 53 04 movl 0x4(%ebx),%edx
Code: c01762d6 <svc_recv+c6/338> 39 c3 cmpl %eax,%ebx
Code: c01762d8 <svc_recv+c8/338> 74 00 je c01762d4
<svc_recv+c4/338>
Code: c01762e0 <svc_recv+d0/338> 90 nop
Code: c01762e1 <svc_recv+d1/338> 90 nop
Code: c01762e2 <svc_recv+d2/338> 90 nop

0xc017629f <svc_recv+143>: movl 0xc01cbe74,%eax
0xc01762a4 <svc_recv+148>: leal 0x1(%eax),%ecx
0xc01762a7 <svc_recv+151>: movl %ecx,0xc01cbe74
0xc01762ad <svc_recv+157>: movl %edx,0xc01b0390
0xc01762b3 <svc_recv+163>: addl $0x2,%eax
0xc01762b6 <svc_recv+166>: movl %eax,0xc01cbe74
0xc01762bb <svc_recv+171>: movl 0x1c(%esp,1),%ecx
0xc01762bf <svc_recv+175>: movl 0x4(%ecx),%ebx <--- oops here.

svc_sock_dequeue appears to be called with serv == NULL, which means that
svc_recv has been called with serv == NULL
I don't know enough about the way it works to investigate helpfully much
further.

Incidentally, I noticed that svc_sock_dequeue calls enable_bh(NET_BH), but
svc_recv does so again later, after doing something else. Should this happen
like this?

disable_bh(NET_BH);
if ((svsk = svc_sock_dequeue(serv)) != NULL) {
....
} else {
- NET_BH is re-enabled here already. Does it matter?

/* No data pending. Go to sleep */
rqstp->rq_sock = NULL;
rqstp->rq_wait = NULL;
svc_serv_enqueue(serv, rqstp);

current->state = TASK_UNINTERRUPTIBLE;
add_wait_queue(&rqstp->rq_wait, &wait);
enable_bh(NET_BH);

P.S.
Why has the second "cr3" in the second line of the oops changed to "(r3"? I
thought it was just a memory problem when I saw it once before, but this is a
different machine. Is something stomping on random bits of kernel memory?

-- 
David Woodhouse,	CB3 9AN		http://dwmw2.robinson.cam.ac.uk/
	dwmw2@cam.ac.uk 		 Tel: 0976 658355        
      ( D.W.Woodhouse@nortel.co.uk	 Tel: 01279 402332 )

(Use the former; I'm going back to College next week.)