Re: [GIT PULL] bcachefs

From: Jens Axboe
Date: Thu Aug 10 2023 - 22:40:15 EST


On 8/10/23 5:47 PM, Linus Torvalds wrote:
> On Thu, 10 Aug 2023 at 15:39, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>>
>> FWIW I recently fixed all my stupid debian package dependencies so that
>> I could actually install liburing again, and rebuilt fstests. The very
>> next morning I noticed a number of new test failures in /exactly/ the
>> way that Kent said to expect:
>>
>> fsstress -d /mnt & <sleep then simulate fs crash>; \
>> umount /mnt; mount /dev/sda /mnt
>>
>> Here, umount exits before the filesystem is really torn down, and then
>> mount fails because it can't get an exclusive lock on the device.
>
> I agree that that obviously sounds like mount is just returning either
> too early. Or too eagerly.
>
> But I suspect any delayed fput() issues (whether from aio or io_uring)
> are then just a way to trigger the problem, not the fundamental cause.
>
> Because even if the fput() is delayed, the mntput() part of that
> delayed __fput action is the one that *should* have kept the
> filesystem mounted until it is no longer busy.
>
> And more importantly, having some of the common paths synchronize
> *their* fput() calls only affects those paths.
>
> It doesn't affect the fundamental issue that the last fput() can
> happen in odd contexts when the file descriptor was used for something
> a bit stranger.
>
> So I do feel like the fput patch I saw looked more like a "hide the
> problem" than a real fix.

The fput patch was not pretty, nor is it needed. What happens on the
io_uring side is that pending requests (which can hold files referenced)
are canceled on exit. But we don't wait for the references to go away,
which then introduces this race.

I've used this to trigger it:

#!/bin/bash

DEV=/dev/nvme0n1
MNT=/data
ITER=0

while true; do
echo loop $ITER
sudo mount $DEV $MNT
fio --name=test --ioengine=io_uring --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --thread=1 --output=/dev/null &
Y=$(($RANDOM % 3))
X=$(($RANDOM % 10))
VAL="$Y.$X"
sleep $VAL
ps -e | grep fio > /dev/null 2>&1
while [ $? -eq 0 ]; do
killall -9 fio > /dev/null 2>&1
wait > /dev/null 2>&1
ps -e | grep "fio " > /dev/null 2>&1
done
sudo umount /data
if [ $? -ne 0 ]; then
break
fi
((ITER++))
done

and can make it happen pretty easily, within a few iterations.

Contrary to how it was otherwise presented in this thread, I did take a
look at this a month ago and wrote up some patches for it. Just rebased
them on the current tree:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-exit-cancel

Since we have task_work involved for both the completions and the
__fput(), ordering is a concern which is why it needs a bit more effort
than just the bare bones stuff. The way the task_work list works, we
llist_del_all() and run all items. But we do encapsulate that in
io_uring anyway, so it's possible to run our pending local items and
avoid that particular snag.

WIP obviously, the first 3-4 prep patches were posted earlier today, but
I'm not happy with the last 3 yet in the above branch. Or at least not
fully confident, so will need a bit more thinking and testing. Does pass
the above test case, and the regular liburing test/regression cases,
though.

--
Jens Axboe