Re: [GIT PULL] bcachefs

From: Jens Axboe
Date: Tue Jun 27 2023 - 23:16:42 EST


On 6/27/23 2:15?PM, Kent Overstreet wrote:
>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
>> failing because it assumed it was XFS.
>>
>> I suspected this was just a timing issue, and it looks like that's
>> exactly what it is. Looking at the test case, it'll randomly kill -9
>> fsstress, and if that happens while we have io_uring IO pending, then we
>> process completions inline (for a PF_EXITING current). This means they
>> get pushed to fallback work, which runs out of line. If we hit that case
>> AND the timing is such that it hasn't been processed yet, we'll still be
>> holding a file reference under the mount point and umount will -EBUSY
>> fail.
>>
>> As far as I can tell, this can happen with aio as well, it's just harder
>> to hit. If the fput happens while the task is exiting, then fput will
>> end up being delayed through a workqueue as well. The test case assumes
>> that once it's reaped the exit of the killed task that all files are
>> released, which isn't necessarily true if they are done out-of-line.
>
> Yeah, I traced it through to the delayed fput code as well.
>
> I'm not sure delayed fput is responsible here; what I learned when I was
> tracking this down has mostly fell out of my brain, so take anything I
> say with a large grain of salt. But I believe I tested with delayed_fput
> completely disabled, and found another thing in io_uring with the same
> effect as delayed_fput that wasn't being flushed.

I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
io_uring can end up doing. But yes, delayed_fput() is another candidate.

>> For io_uring specifically, it may make sense to wait on the fallback
>> work. The below patch does this, and should fix the issue. But I'm not
>> fully convinced that this is really needed, as I do think this can
>> happen without io_uring as well. It just doesn't right now as the test
>> does buffered IO, and aio will be fully sync with buffered IO. That
>> means there's either no gap where aio will hit it without O_DIRECT, or
>> it's just small enough that it hasn't been hit.
>
> I just tried your patch and I still have generic/388 failing - it
> might've taken a bit longer to pop this time.

Yep see the same here. Didn't have time to look into it after sending
that email today, just took a quick stab at writing a reproducer and
ended up crashing bcachefs:

[ 1122.384909] workqueue: Failed to create a rescuer kthread for wq "bcachefs": -EINTR
[ 1122.384915] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 1122.385814] Mem abort info:
[ 1122.385962] ESR = 0x0000000096000004
[ 1122.386161] EC = 0x25: DABT (current EL), IL = 32 bits
[ 1122.386444] SET = 0, FnV = 0
[ 1122.386612] EA = 0, S1PTW = 0
[ 1122.386842] FSC = 0x04: level 0 translation fault
[ 1122.387168] Data abort info:
[ 1122.387321] ISV = 0, ISS = 0x00000004
[ 1122.387518] CM = 0, WnR = 0
[ 1122.387676] user pgtable: 4k pages, 48-bit VAs, pgdp=00000001133da000
[ 1122.388014] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 1122.388363] Internal error: Oops: 0000000096000004 [#1] SMP
[ 1122.388659] Modules linked in:
[ 1122.388866] CPU: 4 PID: 23129 Comm: mount Not tainted 6.4.0-02556-ge61c7fc22b68-dirty #3647
[ 1122.389389] Hardware name: linux,dummy-virt (DT)
[ 1122.389682] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 1122.390118] pc : bch2_free_pending_node_rewrites+0x40/0x90
[ 1122.390466] lr : bch2_free_pending_node_rewrites+0x28/0x90
[ 1122.390815] sp : ffff80002481b770
[ 1122.391030] x29: ffff80002481b770 x28: ffff0000e1d24000 x27: 00000000fffff7b7
[ 1122.391475] x26: 0000000000000000 x25: ffff0000e1d00040 x24: dead000000000122
[ 1122.391919] x23: dead000000000100 x22: ffff0000e1d031b8 x21: ffff0000e1d00040
[ 1122.392366] x20: 0000000000000000 x19: ffff0000e1d031a8 x18: 0000000000000009
[ 1122.392860] x17: 3a22736665686361 x16: 6362222071772072 x15: 6f66206461657268
[ 1122.393622] x14: 746b207265756373 x13: 52544e49452d203a x12: 0000000000000001
[ 1122.395170] x11: 0000000000000001 x10: 0000000000000000 x9 : 00000000000002d3
[ 1122.396592] x8 : 00000000000003f8 x7 : 0000000000000000 x6 : ffff8000093c2e78
[ 1122.397970] x5 : ffff000209de4240 x4 : ffff0000e1d00030 x3 : dead000000000122
[ 1122.399263] x2 : 00000000000031a8 x1 : 0000000000000000 x0 : 0000000000000000
[ 1122.400473] Call trace:
[ 1122.400908] bch2_free_pending_node_rewrites+0x40/0x90
[ 1122.401783] bch2_fs_release+0x48/0x24c
[ 1122.402589] kobject_put+0x7c/0xe8
[ 1122.403271] bch2_fs_free+0xa4/0xc8
[ 1122.404033] bch2_fs_alloc+0x5c8/0xbcc
[ 1122.404888] bch2_fs_open+0x19c/0x430
[ 1122.405781] bch2_mount+0x194/0x45c
[ 1122.406643] legacy_get_tree+0x2c/0x54
[ 1122.407476] vfs_get_tree+0x28/0xd4
[ 1122.408251] path_mount+0x5d0/0x6c8
[ 1122.409056] do_mount+0x80/0xa4
[ 1122.409866] __arm64_sys_mount+0x150/0x168
[ 1122.410904] invoke_syscall.constprop.0+0x70/0xb8
[ 1122.411890] do_el0_svc+0xbc/0xf0
[ 1122.412596] el0_svc+0x74/0x9c
[ 1122.413343] el0t_64_sync_handler+0xa8/0x134
[ 1122.414148] el0t_64_sync+0x168/0x16c
[ 1122.414863] Code: f2fbd5b7 d2863502 91008af8 8b020273 (f85d8695)
[ 1122.415939] ---[ end trace 0000000000000000 ]---

> I wonder if there might be a better way of solving this though? For aio,
> when a process is exiting we just synchronously tear down the ioctx,
> including waiting for outstanding iocbs.

aio is pretty trivial, because the only async it supports is O_DIRECT
on regular files which always completes in finite time. io_uring has to
cancel etc, so we need to do a lot more.

But the concept of my patch should be fine, but I think we must be
missing a case. Which is why I started writing a small reproducer
instead. I'll pick it up again tomorrow and see what is going on here.

> delayed_fput, even though I believe not responsible here, seems sketchy
> to me because there doesn't seem to be a straightforward way to flush
> delayed fputs for a given _process_ - there's a single global work item,
> and we can only flush globally.

Yep as mentioned I don't think it's delayed_fput at all. And yeah we can
only globally flush that, not per task/files_struct.

--
Jens Axboe