Re: [GIT PULL] bcachefs

From: Darrick J. Wong
Date: Thu Aug 10 2023 - 18:39:48 EST


On Thu, Aug 10, 2023 at 11:54:53AM -0400, Kent Overstreet wrote:
> Adding Jens to the CC:

<snip to the parts I care most about>

> > and the whole crazy discussion about fput being delayed. It
> > is what it is, and the patches I saw in this thread to not delay them
> > were bad.
>
> Jens claimed AIO was broken in the same way as io_uring, but it turned
> out that it's not - the test he posted was broken.
>
> And io_uring really is broken here. Look, the tests that are breaking
> because of this are important ones (generic/388 in particular), and
> those tests are no good to us if they're failing because of io_uring
> crap and Jens is throwing up his hands and saying "trainwreck!" when we
> try to get it fixed.

FWIW I recently fixed all my stupid debian package dependencies so that
I could actually install liburing again, and rebuilt fstests. The very
next morning I noticed a number of new test failures in /exactly/ the
way that Kent said to expect:

fsstress -d /mnt & <sleep then simulate fs crash>; \
umount /mnt; mount /dev/sda /mnt

Here, umount exits before the filesystem is really torn down, and then
mount fails because it can't get an exclusive lock on the device. As a
result, I can't test crash recovery or corrupted metadata shutdowns
because of this delayed fput thing or whatever. It all worked before
(even with libaio in use) I turned on io_uring.

Obviously, I "fixed" this by modifying fsstress to require explicit
enabling of io_uring operations; everything went back to green after
that.

I'm not familiar enough with the kernel side of io_uring to know what
the solution here is; I'm merely here to provide a second data point.

<snip again>

> > The thing that actually bothers me most about this all is the personal
> > arguments I saw. That I don't know what to do about. I don't actually
> > want to merge this over the objections of Christian, now that we have
> > a responsible vfs maintainer.
>
> I don't want to do that to Christian either, I think highly of the work
> he's been doing and I don't want to be adding to his frustration. So I
> apologize for loosing my cool earlier; a lot of that was frustration
> from other threads spilling over.
>
> But: if he's going to be raising objections, I need to know what his
> concerns are if we're going to get anywhere. Raising objections without
> saying what the concerns are shuts down discussion; I don't think it's
> unreasonable to ask people not to do that, and to try and stay focused
> on the code.

Yeah, I'm also really happy that we have a new/second VFS maintainer. I
figure it's going to take us a while to help Christian to get past his
fear and horror at the things lurking in fs/ but that's something worth
doing.

(I'm not presuming to know what Christian feels about the VFS; 'fear and
horror' is what *I* feel every time I have to go digging down there.
I'm extrapolating about what I would need, were I a new maintainer, to
get myself to the point where I would have an open enough mind to engage
with new or unfamiliar concepts so that a review cycle for something as
big as bcachefs/online fsck/whatever would be productive.)

> He's got an open invite to the bcachefs meeting, and we were scheduled
> to talk Tuesday but he was out sick - anyways, I'm looking forward to
> hearing what he has to say.
>
> More broadly, it would make me really happy if we could get certain
> people to take a more constructive, "what do we really care about here
> and how do we move forward" attitude

...and "what are all the supporting structures that we need to have in
place to maximize the chances that we'll accomplish those goals"?

> instead of turning every
> interaction into an opportunity to dig their heels in on process and
> throw up barriers.
>
> That burns people out, fast. And it's getting to be a problem in
> -fsdevel land;

Past-participle, not present. :/

I've said this previously, and I'll say it again: we're severely
under-resourced. Not just XFS, the whole fsdevel community. As a
developer and later a maintainer, I've learnt the hard way that there is
a very large amount of non-coding work is necessary to build a good
filesystem. There's enough not-really-coding work for several people.
Instead, we lean hard on maintainers to do all that work. That might've
worked acceptably for the first 20 years, but it doesn't now.

Nowadays we have all these people running bots and AIs throwing a steady
stream of bug reports and CVE reports at Dave [Chinner] and I. Most of
these people *do not* help fix the problems they report. Once in a
while there's an actual *user* report about data loss, but those
(thankfully) aren't the majority of the reports.

However, every one of these reports has to be triaged, analyzed, and
dealt with. As soon as we clear one, at least one more rolls in. You
know what that means? Dave and I are both in a permanent state of
heightened alert, fear, and stress. We never get to settle back down to
calm. Every time someone brings up syzbot, CVEs, or security? I feel
my own stress response ramping up. I can no longer have "rational"
conversations about syzbot because those discussions push my buttons.

This is not healthy!

Add to that the many demands to backport this and that to dozens of LTS
kernels and distro kernels. Why do the participation modes for that
seem to be (a) take on an immense amount of backporting work that you
didn't ask for; or (b) let a non-public ML thing pick patches and get
yelled at when it does the wrong thing? Nobody ever asked me if I
thought the XFS community could support such-and-such LTS kernel.

As the final insult, other people pile on by offering useless opinions
about the maintainers being far behind and unhelpful suggestions that we
engage in a major codebase rewrite. None of this is helpful.

Dave and I are both burned out. I'm not sure Dave ever got past the
2017 burnout that lead to his resignation. Remarkably, he's still
around. Is this (extended burnout) where I want to be in 2024? 2030?
Hell no.

I still have enough left that I want to help ourselves adapt our culture
to solve these problems. I tried to get the conversation started with
the maintainer entry profile for XFS that I recently submitted, but that
alone cannot be the final product:
https://lore.kernel.org/linux-xfs/169116629797.3243794.7024231508559123519.stgit@frogsfrogsfrogs/T/#m74bac05414cfba214f5cfa24a0b1e940135e0bed

Being maintainer feels like a punishment, and that cannot stand.
We need help.

People see the kinds of interpersonal interactions going on here and
decide pursue any other career path. I know so, some have told me
themselves.

You know what's really sad? Most of my friends work for small
companies, nonprofits, and local governments. They report the same
problems with overwork, pervasive fear and anger, and struggle to
understand and adapt to new ideas that I observe here. They see the
direct connection between their org's lack of revenue and the under
resourcedness.

They /don't/ understand why the hell the same happens to me and my
workplace proximity associates, when we all work for companies that
each clear hundreds of billions of dollars in revenue per year.

(Well, they do understand: GREED. They don't get why we put up with
this situation, or why we don't advocate louder for making things
better.)

> I've lost count of the times I've heard Eric Sandeen
> complain about how impossible it is to get things merge,

A group dynamic that I keep observing around here is that someone tries
to introduce some unfamiliar (or even slightly new) concept, because
they want the kernel to do something it didn't do before. The author
sends out patches for review, and some of the reviewers who show up
sound like they're so afraid of ... something ... that they throw out
vague arguments that something might break.

[I have had people tell me in private that while they don't have any
specific complaints about online fsck, "something" is wrong and I need
to stop and consider more thoroughly. Consider /what/?]

Or, worse, no reviewers show up. The author merges it, and a month
later there's a freakout because something somewhere else broke. Angry
threads spread around fsdevel because now there's pressure to get it
fixed before -rc8 (in the good case) or ASAP (because now it's
released). Did the author have an incomplete understanding of the code?
Were there potential reviewers who might've said something but bailed?
Yes and yes.

What do we need to reduce the amount of fear and anger around here,
anyway? 20 years ago when I started my career in Linux I found the work
to be challenging and enjoyable. Now I see a lot more anger, and I am
sad, because there /are/ still enjoyable challenges to be undertaken.
Can we please have that conversation?

People and groups do not do well when they feel like they're under
constant attack, like they have to brace themselves for whatever
bullshit is coming next. That is how I feel most weeks, and I choose
not to do that anymore.

> and I _really_
> hope people are taking notice about Darrick stepping away from XFS and
> asking themselves what needs to be sorted out.

Me too. Ted expressed similar laments about ext4 after I announced my
intention to reduce my own commitments to XFS.

> Darrick writes
> meticulous, well documented code; when I think of people who slip by
> hacks other people are going to regret later, he's not one of them.

I appreciate the compliment. ;)

>From what I can tell (because I lolquit and finally had time to start
scanning the bcachefs code) I really like the thought that you've put
into indexing and record iteration in the filesystem. I appreciate the
amount of work you've put into making it easy and fast to run QA on
bcachefs, even if we don't quite agree on whether or not I should rip
and replace my 20yo Debian crazyquilt.

> And yet, online fsck for XFS has been pushed back repeatedly because
> of petty bullshit.

A broader dynamic here is that I ask people to review the code so that I
can merge it; they say they will do it; and then an entire cycle goes by
without any visible progress.

When I ask these people why they didn't follow through on their
commitments, the responses I hear are pretty uniform -- they got buried
in root cause analysis of a real bug report but lol there were no other
senior people available; their time ended up being spent on backports or
arguing about backports; or they got caught up in that whole freakout
thing I described above.

> Scaling laws being what they are, that's a feature we're going to need,
> and more importantly XFS cannot afford to lose more people - especially
> Darrick.

While I was maintainer I lobbied managers at Oracle and Google and RH to
hire new people to grow the size of the XFS community, and they did.
That was awesome! It's not so hard to help managers come up with
business justifications for headcount for critical pieces of their
products*.

But.

For 2023 XFS is already down 2 people + whatever the hell I was doing
that isn't "trying to get online fsck merged". We're still at +1, but
still who's going to replace us oldtimers?

--D

* But f*** impossible to get that done when it's someone's 20% project
causing a lot of friction on the mailing lists.

> To speak a bit to what's been driving _me_ a bit nuts in these
> discussions, top of my list is that the guy who's been the most
> obstinate and argumentative _to this day_ refuses to CC me when touching
> code I wrote - and as a result we've had some really nasty bugs (memory
> corruption, _silent data corruption_).
>
> So that really needs to change. Let's just please have a little more
> focus on not eating people's data, and being more responsible about
> bugs.
>
> Anyways, I just want to write the best code I can. That's all I care
> about, and I'm always happy to interact with people who share that goal.
>
> Cheers,
> Kent