Re: Linux 2.6.29

From: Theodore Tso
Date: Fri Mar 27 2009 - 01:14:04 EST


On Fri, Mar 27, 2009 at 03:47:05AM +0000, Matthew Garrett wrote:
> Oh, for the love of a whole range of mythological figures. ext3 didn't
> train application programmers that they could be careless about fsync().
> It gave them functionality that they wanted, ie the ability to do things
> like rename a file over another one with the expectation that these
> operations would actually occur in the same order that they were
> generated. More to the point, it let them do this *without* having to
> call fsync(), resulting in a significant improvement in filesystem
> usability.

Matthew,

There were plenty of applications that were written for Unix *and*
Linux systems before ext3 existed, and they worked just fine. Back
then, people were drilled into the fact that they needed to use
fsync(), and fsync() wan't expensive, so there wasn't a big deal in
terms of usability. The fact that fsync() was expensive was precisely
because of ext3's data=ordered problem. Writing files safely meant
that you had to check error returns from fsync() *and* close().

In fact, if you care about making sure that data doesn't get lost due
to disk errors, you *must* call fsync(). Pavel may have complained
that fsync() can sometimes drop errors if some other process also has
the file open and calls fsync() --- but if you don't, and you rely on
ext3 to magically write the data blocks out as a side effect of the
commit in data=ordered mode, there's no way to signal the write error
to the application, and you are *guaranteed * to lose the I/O error
indication.

I can tell you quite authoritatively that we didn't implement
data=ordered to make life easier for application writers, and
application writers didn't come to ext3 developers asking for this
convenience. It may have **accidentally** given them convenience that
they wanted, but it also made fsync() slow.

> I'm utterly and screamingly bored of this "Blame userspace" attitude.

I'm not blaming userspace. I'm blaming ourselves, for implementing an
attractive nuisance, and not realizing that we had implemented an
attractive nuisance; which years later, is also responsible for these
latency problems, both with and without fsync() ---- *and* which have
also traied people into believing that fsync() is always expensive,
and must be avoided at all costs --- which had not previously been
true!

If I had to do it all over again, I would have argued with Stephen
about making data=writeback the default, which would have provided
behaviour on crash just like ext2, except that we wouldn't have to
fsck the partition afterwards. Back then, people lived with the
potential security exposure on a crash, and they lived with the fact
that you had to use fsync(), or manually type "sync", if you wanted to
guarantee that data would be safely written to disk. And you know
what? Things had been this way with Unix systems for 31 years before
ext3 came on the scene, and things worked pretty well during those
three decades.

So again, let it make it clear, I'm not "blaming userspace". I'm
blaming ext3 data=ordered mode. But it's trained application writers
to program systems a certain way, and it's trained them to assume that
fsync() is always evil, and they outnumber us kernel programmers, and
so we are where we are. And data=ordered mode is also responsible for
these write latency problems which seems to make Ingo so cranky ---
and rightly so. It all comes from the same source.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/