Re: [GIT PULL] bcachefs

From: Kent Overstreet
Date: Thu Aug 10 2023 - 14:02:55 EST


On Thu, Aug 10, 2023 at 09:40:08AM -0700, Linus Torvalds wrote:
> > > Some of the other oddity is around the this_cpu ops, but I suspect
> > > that is at least partly then because we don't have acquire/release
> > > versions of the local cpu ops that the code looks like it would want.
> >
> > You mean using full barriers where acquire/release would be sufficient?
>
> Yes.
>
> That code looks like it should work, but be hugely less efficient than
> it might be. "smp_mb()" tends to be expensive everywhere, even x86.

do_six_unlock_type() doesn't need a full barrier, but I'm not sure we
can avoid the one in __do_six_trylock(), in the percpu reader path.

> Of course, I might be missing some other cases. That percpu reader
> queue worries me a bit just because it ends up generating ordering
> based on two different things - the lock word _and_ the percpu word.
>
> And I get very nervous if the final "this gets the lock" isn't some
> obvious "try_cmpxchg_acquire()" or similar, just because we've
> historically had *so* many very subtle bugs in just about every single
> lock we've ever had.

kernel/locking/percpu-rwsem.c uses the same idea. The difference is that
percpu-rwsem avoids the memory barrier on the read side in the fast path
at the cost of requiring an rcu barrier on the write side... and all the
crazyness that entails.

But __percpu_down_read_trylock() uses the same algorithm I'm using,
including the same smp_mb(): we need to ensure that the read of the lock
state happens after the store to the percpu read count, and I don't know
how to that without a smp_mb() - smp_store_acquire() isn't a thing.

> > Matthew was planning on sending the iov_iter patch to you - right around
> > now, I believe, as a bugfix, since right now
> > copy_page_from_iter_atomic() silently does crazy things if you pass it a
> > compound page.
> >
> > Block layer patches aside, are there any _others_ you really want to go
> > via maintainers?
>
> It was mainly just the iov and the block layer.
>
> The superblock cases I really don't understand why you insist on just
> being different from everybody else.
>
> Your exclusivity arguments make no sense to me. Just open the damn
> thing. No other filesystem has ever had the fundamental problems you
> describe. You can do any exclusivity test you want in the
> "test()/set()" functions passed to sget().

When using sget() in the conventional way it's not possible for
FMODE_EXCL to protect against concurrent opens scribbling over each
other because we open the block device before checking if it's already
mounted, and we expect that open to succeed.

> You say that it's a problem because of a "single spinlock", but it
> hasn't been a problem for anybody else.

The spinlock means you can't do the actual open in set(), which is why
the block device has to be opened in not-really-exclusive mode.

I think it's be possible to change the locking in sget() so that the
set() callback could do the open, but I haven't looked closely at it.

> and basically sent your first pull request as a fait-accompli.

When did I ever do that?