Re: [PATCH] nfs4: skip locks_lock_inode_wait() in nfs4_locku_done if FL_ACCESS is set

From: Vasily Averin
Date: Sun Dec 05 2021 - 05:43:45 EST


On 05.12.2021 13:12, Vasily Averin wrote:
> In 2006 Trond Myklebust added support for the FL_ACCESS flag,
> commit 01c3b861cd77 ("NLM,NFSv4: Wait on local locks before we put RPC
> calls on the wire"), as a result of which _nfs4_proc_setlk() began
> to execute _nfs4_do_setlk() with modified request->fl_flag where
> FL_ACCESS flag was set.
>
> It was not important not till 2015, when commit c69899a17ca4 ("NFSv4:
> Update of VFS byte range lock must be atomic with the stateid update")
> added do_vfs_lock call into nfs4_locku_done().
> nfs4_locku_done() in this case uses calldata->fl of nfs4_unlockdata.
> It is copied from struct nfs4_lockdata, which in turn uses the fl_flag
> copied from the request->fl_flag provided by _nfs4_do_setlk(), i.e. with
> FL_ACCESS flag set.
>
> FL_ACCESS flag is removed in nfs4_lock_done() for non-cancelled case.
> however rpc task can be cancelled earlier.
>
> As a result flock_lock_inode() can be called with request->fl_type F_UNLCK
> and fl_flags with FL_ACCESS flag set.
> Such request is processed incorectly. Instead of expected search and
> removal of exisiting flocks it jumps to "find_conflict" label and can call
> locks_insert_block() function.
>
> On kernels before 2018, (i.e. before commit 7b587e1a5a6c
> ("NFS: use locks_copy_lock() to copy locks.")) it caused a BUG in
> __locks_insert_block() because copied fl had incorrectly linked fl_block.

originally it was foudn during processing of real customers bugreports on
RHEL7-based OpenVz7 kernel.
kernel BUG at fs/locks.c:612!
CPU: 7 PID: 1019852 Comm: kworker/u65:43 ve: 0 Kdump: loaded Tainted: G W O ------------ 3.10.0-1160.41.1.vz7.183.5 #1 183.5
Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.3 05/23/2018
Workqueue: rpciod rpc_async_schedule [sunrpc]
task: ffff9d50e5de0000 ti: ffff9d3c9ec10000 task.ti: ffff9d3c9ec10000
RIP: 0010:[<ffffffffbe0d590a>] [<ffffffffbe0d590a>] __locks_insert_block+0xea/0xf0
RSP: 0018:ffff9d3c9ec13c78 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff9d529554e180 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffff9d51d2363a98 RDI: ffff9d51d2363ab0
RBP: ffff9d3c9ec13c88 R08: 0000000000000003 R09: ffff9d5f5b8dfcd0
R10: ffff9d5f5b8dfd08 R11: ffffbb21594b5a80 R12: ffff9d51d2363a98
R13: 0000000000000000 R14: ffff9d50e5de0000 R15: ffff9d3da03915f8
FS: 0000000000000000(0000) GS:ffff9d55bfbc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f93d65ee1e8 CR3: 00000029a04d6000 CR4: 00000000000607e0
Call Trace:
[<ffffffffbe0d5939>] locks_insert_block+0x29/0x40
[<ffffffffbe0d6d5b>] flock_lock_inode_wait+0x2bb/0x310
[<ffffffffc01c7470>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
[<ffffffffbe0d6dce>] locks_lock_inode_wait+0x1e/0x40
[<ffffffffc0c9f5c0>] nfs4_locku_done+0x90/0x190 [nfsv4]
[<ffffffffc01bb750>] ? call_decode+0x1f0/0x880 [sunrpc]
[<ffffffffc01c7470>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
[<ffffffffc01c74a1>] rpc_exit_task+0x31/0x90 [sunrpc]
[<ffffffffc01c9654>] __rpc_execute+0xe4/0x470 [sunrpc]
[<ffffffffc01c99f2>] rpc_async_schedule+0x12/0x20 [sunrpc]
[<ffffffffbdec1b25>] process_one_work+0x185/0x440
[<ffffffffbdec27e6>] worker_thread+0x126/0x3c0
[<ffffffffbdec26c0>] ? manage_workers.isra.26+0x2a0/0x2a0
[<ffffffffbdec9e31>] kthread+0xd1/0xe0
[<ffffffffbdec9d60>] ? create_kthread+0x60/0x60
[<ffffffffbe5d2eb7>] ret_from_fork_nospec_begin+0x21/0x21
[<ffffffffbdec9d60>] ? create_kthread+0x60/0x60
Code: 48 85 d2 49 89 54 24 08 74 04 48 89 4a 08 48 89 0c c5 c0 ee 09 bf 49 89 74 24 10 5b 41 5c 5d c3 90 49 8b 44 24 28 e9 80 ff ff ff <0f> 0b 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 41 54 49 89 f4 53
RIP [<ffffffffbe0d590a>] __locks_insert_block+0xea/0xf0
RSP <ffff9d3c9ec13c78>

In crashdump I've found nfs4_lockudata and (already freed but not reused) nfs4_lockdata
both have fl->fl_flags = 0x8a.

Thank you,
Vasily Averin
i.e have set FL_SLEEP, FL_ACCESS and FL_FLOCK.
fl_flags = 0x8a,