Re: [GIT PULL] bcachefs

From: Christian Brauner
Date: Wed Jun 28 2023 - 13:33:39 EST


On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote:
> On 6/28/23 8:58?AM, Jens Axboe wrote:
> > On 6/27/23 10:01?PM, Kent Overstreet wrote:
> >> On Tue, Jun 27, 2023 at 09:16:31PM -0600, Jens Axboe wrote:
> >>> On 6/27/23 2:15?PM, Kent Overstreet wrote:
> >>>>> to ktest/tests/xfstests/ and run it with -bcachefs, otherwise it kept
> >>>>> failing because it assumed it was XFS.
> >>>>>
> >>>>> I suspected this was just a timing issue, and it looks like that's
> >>>>> exactly what it is. Looking at the test case, it'll randomly kill -9
> >>>>> fsstress, and if that happens while we have io_uring IO pending, then we
> >>>>> process completions inline (for a PF_EXITING current). This means they
> >>>>> get pushed to fallback work, which runs out of line. If we hit that case
> >>>>> AND the timing is such that it hasn't been processed yet, we'll still be
> >>>>> holding a file reference under the mount point and umount will -EBUSY
> >>>>> fail.
> >>>>>
> >>>>> As far as I can tell, this can happen with aio as well, it's just harder
> >>>>> to hit. If the fput happens while the task is exiting, then fput will
> >>>>> end up being delayed through a workqueue as well. The test case assumes
> >>>>> that once it's reaped the exit of the killed task that all files are
> >>>>> released, which isn't necessarily true if they are done out-of-line.
> >>>>
> >>>> Yeah, I traced it through to the delayed fput code as well.
> >>>>
> >>>> I'm not sure delayed fput is responsible here; what I learned when I was
> >>>> tracking this down has mostly fell out of my brain, so take anything I
> >>>> say with a large grain of salt. But I believe I tested with delayed_fput
> >>>> completely disabled, and found another thing in io_uring with the same
> >>>> effect as delayed_fput that wasn't being flushed.
> >>>
> >>> I'm not saying it's delayed_fput(), I'm saying it's the delayed putting
> >>> io_uring can end up doing. But yes, delayed_fput() is another candidate.
> >>
> >> Sorry - was just working through my recollections/initial thought
> >> process out loud
> >
> > No worries, it might actually be a combination and this is why my
> > io_uring side patch didn't fully resolve it. Wrote a simple reproducer
> > and it seems to reliably trigger it, but is fixed with an flush of the
> > delayed fput list on mount -EBUSY return. Still digging...
>
> I discussed this with Christian offline. I have a patch that is pretty
> simple, but it does mean that you'd wait for delayed fput flush off
> umount. Which seems kind of iffy.
>
> I think we need to back up a bit and consider if the kill && umount
> really is sane. If you kill a task that has open files, then any fput
> from that task will end up being delayed. This means that the umount may
> very well fail.

That's why we have MNT_DETACH:

umount2("/mnt", MNT_DETACH)

will succeed even if fds are still open.

>
> It'd be handy if we could have umount wait for that to finish, but I'm
> not at all confident this is a sane solution for all cases. And as
> discussed, we have no way to even identify which files we'd need to
> flush out of the delayed list.
>
> Maybe the test case just needs fixing? Christian suggested lazy/detach
> umount and wait for sb release. There's an fsnotify hook for that,
> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems
> to me that this would be the way to make it more reliable when killing
> of tasks with open files are involved.

You can wait on superblock destruction today in multiple ways. Roughly
from the shell this should work:

root@wittgenstein:~# cat sb_wait.sh
#! /bin/bash

echo "WARNING WARNING: I SUCK AT SHELL SCRIPTS"

echo "mount fs"
sudo mount -t tmpfs tmpfs /mnt
touch /mnt/bla

echo "pin sb by random file for 5s"
sleep 5 > /mnt/bla &

echo "establish inotify watch for sb destruction"
inotifywait -e unmount /mnt &
pid=$!

echo "regular umount - will fail..."
umount /mnt

findmnt | grep "/mnt"

echo "lazily umount - will succeed"
umount -l /mnt

findmnt | grep "/mnt"

echo "and now we wait"
wait $!

echo "done"

Can also use a tiny bpf lsm, fanotify in the future as we plans for
that.