Re: Linux 2.6.29

From: Jeff Garzik
Date: Sat Mar 28 2009 - 21:20:18 EST


Mark Lord wrote:
The better solution seems to be the rather obvious one:

the filesystem should commit data to disk before altering metadata.

Much easier and more reliable to centralize it there, rather than
rely (falsely) upon thousands of programs each performing numerous
performance-killing fsync's.

Firstly, the FS data/metadata write-out order says nothing about when the write-out is started by the OS. It only implies consistency in the face of a crash during write-out. Hooray for BSD soft-updates.

If the write-out is started immediately during or after write(2), congratulations, you are on your way to reinventing synchronous writes.

If the write-out does not start immediately, then you have a many-seconds window for data loss. And it should be self-evident that userland application writers will have some situations where design requirements dictate minimizing or eliminating that window.


Secondly, this email sub-thread is not talking about thousands of programs, it is talking about Firefox behavior. Firefox is a multi-OS portable application that has a design requirement that user data must be protected against crashes. (same concept as your word processor's auto-save feature)

The author of such a portable application must ensure their app saves data against Windows Vista kernel crashes, HPUX kernel crashes, OS X window system crashes, X11 window system crashes, application crashes, etc.

Can a portable app really rely on what Linux kernel hackers think the underlying filesystem _should_ do?

No, it is either (a) not going to care at all, or (b) uses fsync(2) or FlushFileBuffers() because if guarantees provided across the OS spectrum, in light of the myriad OS filesystem caching, flushing, and ordering algorithms.



Was the BSD soft-updates idea of FS data-before-metadata a good one? Yes. Obviously.

It is the cornerstone of every SANE journalling-esque database or filesystem out there -- don't leave a window where your metadata is inconsistent. "Duh" :)

But that says nothing about when a userland app's design requirements include ordered writes+flushes of its own application data. That is the common case when a userland app like Firefox uses a transactional database such as sqlite or db4.

Thus it is the height of silliness to think that FS data/metadata write-out order permits elimination of fsync(2) for the class of application that must care about ordered writes/flushes of its own application data.

That upstream sqlite replaced fsync(2) with fdatasync(2) makes it obvious that FS data/metadata write-out order is irrelevant to Firefox.

The issue with transactional databases is more simply a design tradeoff -- level of fsync punishment versus performance etc. Tweaking the OS filesystem doesn't help at all with those design choices.

Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/