Re: [GIT PULL] gfs2 fix

From: Linus Torvalds
Date: Wed Apr 27 2022 - 18:20:32 EST


On Wed, Apr 27, 2022 at 2:26 PM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote:
>
> Well, POSIX explicitly mentions those atomicity expectations, e.g.,
> for read [1]:

Yes. I'm aware. And my point is that we've never done _that_ kind of atomicity.

It's also somewhat ambiguous what it actually means, since what it
then talks about is "all bytes that started out together ends
together" and "interleaving".

That all implies that it's about the *position* of the reads and
writes being atomic, not the *data* of the reads and writes.

That, btw, was something we honored even before we had the locking
around f_pos accesses - a read or write system call would get its own
local *copy* the file position, the read or write would then do that
IO based on that copied position - so that things that "started out
together ends together" - and then after the operation is done it
would *update* the file position atomically.

Note that that is exactly so that data would end up "together". But it
would mean that two concurrent reads using the same file position
might read the *same* area of the file.

Which still honors that "the read is atomic wrt the range", but
obviously the actual values of "f_pos" is basically random after the
read (ie is it the end of the first read, or the end of the second
read?).

The same paragraph also explicitly mentions pipes and FIFOs, despite
an earlier paragraph dismissing them, which is all just a sign of
things being very confused.

Anyway, I'm not objecting very sternously to making it very clear in
some documentation that this "data atomicity" is not what Linux has
ever done. If you do overlapping IO, you get what you deserve.

But I do have objections.

On one hand, it's not all that different from some of the other notes
we have in the man-pages (ie documenting that whole "just under 2GB"
limit on the read size, although that's actually using the wrong
constant: it's not 0x7ffff000 bytes, it's MAX_RW_COUNT, which is
"INT_MAX & PAGE_MASK" and that constant in the man-page is as such
only true on a system with 4kB page sizes)

BUT! I'm 100% convinced that NOBODY HAS EVER given the kind of
atomicity guarantees that you would see from reading that document as
a language-lawyer.

For example, that section "2.9.7 Thread Interactions with Regular File
Operations" says that "fstat()" is atomic wrt "write()", and that you
should see "all or nothing".

I *GUARANTEE* that no operating system ever has done that, and I
further claim that reading it the way you read it is not only against
reality, it's against sanity.

Example: if I do a big write to a file that I just created, do you
really want "fstat()" in another thread or process to not even be able
to see how the file grows as the write happens?

It's not what anybody has *EVER* done, I'm pretty sure.

So I really think

(a) you are mis-reading the standard by attributing too strong logic
to paperwork that is English prose and not so exact

(b) documenting Linux as not doing what you are mis-reading it for is
only encouraging others to mis-read it too

The whole "arbitrary writes have to be all-or-nothing wrt all other
system calls" is simply not realistic, and has never been. Not just
not in Linux, but in *ANY* operating system that POSIX was meant to
describe.

And equally importantly: if some crazy person were to actually try to
implement such "true atomicity" things, the end result would be
objectively worse. Because you literally *want* to see a big write()
updating the file length as the write happens.

The fact that the standard then doesn't take those kinds of details
into account is simply because the standard isn't meant to be read as
a language lawyer, but as a "realistically .."

Linus