Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

From: Jeff Layton
Date: Sun Jan 13 2008 - 08:27:48 EST


On Wed, 9 Jan 2008 17:35:42 +0000
Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:

> On Tue, Jan 08, 2008 at 02:33:15PM -0500, Jeff Layton wrote:
> > lockd_start_done is a global var that can be reused if lockd is
> > restarted, but it's never reinitialized. On all but the first use,
> > wait_for_completion isn't actually waiting on it since it has
> > already completed once.
>
> I don't think we'll need lockd_start_done anymore after the kthread
> conversion. When kthread_run returns the thread it created is
> guaranteed to have run until it scheduled away.
>

Christoph,
I've been hitting an intermittent null pointer dereference ever
since I've made this change:

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000038
printing eip: e09ddee1 *pde = 1f377067 *pte = 00000000
Oops: 0000 [#1] SMP
Modules linked in: nfsd nfs_acl auth_rpcgss exportfs rfcomm l2cap bluetooth autofs4 lockd sunrpc nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 loop dm_multipath pcspkr 8139cp 8139too mii joydev i2c_piix4 i2c_core sr_mod sg cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd

Pid: 1946, comm: rpc.nfsd Not tainted (2.6.24-0.138.rc7.kthread.2.fc9 #1)
EIP: 0060:[<e09ddee1>] EFLAGS: 00010202 CPU: 0
EIP is at find_socket+0xa/0x3f [lockd]
EAX: 00000000 EBX: 00000006 ECX: 00000000 EDX: 00000011
ESI: 00000000 EDI: 00000011 EBP: df358ec4 ESP: df358eb8
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process rpc.nfsd (pid: 1946, ti=df358000 task=df38ae10 task.ti=df358000)
Stack: 00000006 00000000 00000006 df358ee0 e09de0cb 22222222 22222222 00000006
00000008 00000801 df358f04 e09de337 00000000 00000000 00000000 00000000
00000801 00000008 00000801 df358f1c e0aa05ac 00000000 df356200 00000008
Call Trace:
[<c040649a>] show_trace_log_lvl+0x1a/0x2f
[<c040654a>] show_stack_log_lvl+0x9b/0xa3
[<c04065f9>] show_registers+0xa7/0x178
[<c04067ff>] die+0x135/0x220
[<c063ff1b>] do_page_fault+0x553/0x631
[<c063e5a2>] error_code+0x72/0x78
[<e09de0cb>] make_socks+0x27/0xbe [lockd]
[<e09de337>] lockd_up+0x3b/0x148 [lockd]
[<e0aa05ac>] nfsd_svc+0xf2/0x107 [nfsd]
[<e0aa0b19>] write_svc+0x1a/0x20 [nfsd]
[<e0aa0d10>] nfsctl_transaction_write+0x39/0x63 [nfsd]
[<c04b97cf>] sys_nfsservctl+0x11f/0x160
[<c0405252>] syscall_call+0x7/0xb
=======================
Code: 89 d8 5b 5d c3 55 89 e5 c7 05 48 a4 9e e0 01 00 00 00 e8 a7 ff ff ff 8b 15 00 0c 76 c0 5d 01 d0 c3 55 89 e5 57 89 d7 56 89 c6 53 <8b> 48 38 83 e9 08 eb 15 8b 41 14 0f b6 40 29 39 f8 75 07 b8 01
EIP: [<e09ddee1>] find_socket+0xa/0x3f [lockd] SS:ESP 0068:df358eb8
---[ end trace 7d509b4c18b144aa ]---

The problem is that make_socks is occasionally getting called with a
NULL nlmsvc_serv pointer. I think the problem occurs here in lockd_up().

if (nlmsvc_task) {
if (proto)
error = make_socks(nlmsvc_serv, proto);
goto out;
}

You pointed out earlier that this should really be checking that
nlmsvc_serv is non-NULL. I can and will make this change, and that will
likely fix this particular oops, but the fact that I'm hitting it here
suggests that kthread_run is returning before lockd has a chance
to set nlmsvc_serv. It shouldn't be according to your statement above.

Are you sure that kthread_run is working correctly? It seems like it
might not be doing the right thing here...

--
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/