Re: [GIT PULL] bcachefs

From: Kent Overstreet
Date: Thu Aug 10 2023 - 22:47:18 EST


On Thu, Aug 10, 2023 at 07:52:05PM +0200, Jan Kara wrote:
> On Thu 10-08-23 11:54:53, Kent Overstreet wrote:
> > > And there clearly is something very strange going on with superblock
> > > handling
> >
> > This deserves an explanation because sget() is a bit nutty.
> >
> > The way sget() is conventionally used for block device filesystems, the
> > block device open _isn't actually exclusive_ - sure, FMODE_EXCL is used,
> > but the holder is the fs type pointer, so it won't exclude with other
> > opens of the same fs type.
> >
> > That means the only protection from multiple opens scribbling over each
> > other is sget() itself - but if the bdev handle ever outlives the
> > superblock we're completely screwed; that's a silent data corruption bug
> > that we can't easily catch, and if the filesystem teardown path has any
> > asynchronous stuff going on (and of course it does) that's not a hard
> > mistake to make. I've observed at least one bug that looked suspiciously
> > like that, but I don't think I quite pinned it down at the time.
>
> This is just being changed - check Christian's VFS tree. There are patches
> that make sget() use superblock pointer as a bdev holder so the reuse
> you're speaking about isn't a problem anymore.

So then the question is what do you use for identifying the superblock,
and you're switching to the dev_t - interesting.

Are we 100% sure that will never break, that a dev_t will always
identify a unique block_device? Namespacing has been changing things.

> > It also forces the caller to separate opening of the block devices from
> > the rest of filesystem initialization, which is a bit less than ideal.
> >
> > Anyways, bcachefs just wants to be able to do real exclusive opens of
> > the block devices, and we do all filesystem bringup with a single
> > bch2_fs_open() call. I think this could be made to work with the way
> > sget() wants to work, but it'd require reworking the locking in
> > sget() - it does everything, including the test() and set() calls, under
> > a single spinlock.
>
> Yeah. Maybe the current upstream changes aren't enough to make your life
> easier for bcachefs, btrfs does its special thing as well after all because
> mount also involves multiple devices for it. I just wanted to mention that
> the exclusive bdev open thing is changing.

I like the mount_bdev() approach in your patch a _lot_ better than the
old code, I think the approach almost works for multi device
filesystems - at least for bcachefs where we always pass in the full
list of devices we want to open, there's no kernel side probing like in
btrfs.

What changes is we'd have to pass a vector of dev_t's to sget(), and
set() needs to be able to stash them in super_block (not s_fs_info, we
need that for bch_fs later and that doesn't exist yet). But that's a
minor detail.

Yeah, this could work.