Re: [GIT PULL] bcachefs

From: Jens Axboe
Date: Wed Jun 28 2023 - 17:23:11 EST


On 6/28/23 2:44?PM, Jens Axboe wrote:
> On 6/28/23 11:52?AM, Kent Overstreet wrote:
>> On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote:
>>> I discussed this with Christian offline. I have a patch that is pretty
>>> simple, but it does mean that you'd wait for delayed fput flush off
>>> umount. Which seems kind of iffy.
>>>
>>> I think we need to back up a bit and consider if the kill && umount
>>> really is sane. If you kill a task that has open files, then any fput
>>> from that task will end up being delayed. This means that the umount may
>>> very well fail.
>>>
>>> It'd be handy if we could have umount wait for that to finish, but I'm
>>> not at all confident this is a sane solution for all cases. And as
>>> discussed, we have no way to even identify which files we'd need to
>>> flush out of the delayed list.
>>>
>>> Maybe the test case just needs fixing? Christian suggested lazy/detach
>>> umount and wait for sb release. There's an fsnotify hook for that,
>>> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems
>>> to me that this would be the way to make it more reliable when killing
>>> of tasks with open files are involved.
>>
>> No, this is a real breakage. Any time we introduce unexpected
>> asynchrony there's the potential for breakage: case in point, there was
>> a filesystem that made rm asynchronous, then there were scripts out
>> there that deleted until df showed under some threshold.. whoops...
>
> This is nothing new - any fput done from an exiting task will end up
> being deferred. The window may be a bit wider now or a bit different,
> but it's the same window. If an application assumes it can kill && wait
> on a task and be guaranteed that the files are released as soon as wait
> returns, it is mistaken. That is NOT the case.

Case in point, just changed my reproducer to use aio instead of
io_uring. Here's the full script:

#!/bin/bash

DEV=/dev/nvme1n1
MNT=/data
ITER=0

while true; do
echo loop $ITER
sudo mount $DEV $MNT
fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null &
Y=$(($RANDOM % 3))
X=$(($RANDOM % 10))
VAL="$Y.$X"
sleep $VAL
ps -e | grep fio > /dev/null 2>&1
while [ $? -eq 0 ]; do
killall -9 fio > /dev/null 2>&1
echo will wait
wait > /dev/null 2>&1
echo done waiting
ps -e | grep "fio " > /dev/null 2>&1
done
sudo umount /data
if [ $? -ne 0 ]; then
break
fi
((ITER++))
done

and if I run that, fails on the first umount attempt in that loop:

axboe@m1max-kvm ~> bash test2.sh
loop 0
will wait
done waiting
umount: /data: target is busy.

So yeah, this is _nothing_ new. I really don't think trying to address
this in the kernel is the right approach, it'd be a lot saner to harden
the xfstest side to deal with the umount a bit more sanely. There are
obviously tons of other ways that a mount could get pinned, which isn't
too relevant here since the bdev and mount point are basically exclusive
to the test being run. But the kill and delayed fput is enough to make
that case imho.

--
Jens Axboe