fsync_range_with_flags() - improving sync_file_range()

From: Jamie Lokier
Date: Thu Apr 23 2009 - 16:44:32 EST


Theodore Tso wrote:
> > sync_file_range() itself is just too weird to use. Reading the man
> > page many times, I still couldn't be sure what it does or is meant to
> > do until asking on l-k a few years ago. My guess, from reading the
> > man page, turned out to be wrong. The recommended way to use it for a
> > database-like application was quite convoluted and required the app to
> > apply its own set of mm-style heuristics. I never did find out if it
> > commits data-locating metadata and file size after extending a file or
> > filling a hole. It never seemed to emit I/O barriers.
>
> Have you looked at the man page for sync_file_range()? It's gotten a
> lot better. My version says it was last updated 2008-05-27, and it
> now answers your question about whether it commits data-locating
> metadata (it doesn't). It now has a bunch of examples how how to use
> the flags in combination.

Yes that's the page I've read and didn't find useful :-)
The data-locating metadata is explained thus:

None of these operations write out the fileâs metadata. Therefore,
unless the application is strictly performing overwrites of already-
instantiated disk blocks, there are no guarantees that the data will be
available after a crash.

First, "the file's metadata". On many OSes, fdatasync() and O_DSYNC
are documented to not write out "the file's metadata" _but_ that often
means inode attributes such as time and mode, not data-locating
metadata which is written, including file size if it increases. Some
are explicit about that (e.g. VxFS), some don't make it clear.
Clearly that is a useful behaviour for fdatasync(), and not writing
data-locating metadata is a lot less useful.

So given what I know of fdatasync() in other OSes documentation, does
"metadata" mean fdatasync() and/or sync_file_range() exclude all
metadata, or just non-data-locating metadata, and what about size changes?

But it's not that bad. sync_file_range() says a bit more, about
overwrites to instantiated data blocks.

Or does it? What about filesystems which don't overwrite all
instantiated data in place? There are a few of those. ext3 with
data=journalling. All flash filesystems. nilfs. Btrfs I'm not sure
about, seems likely. They all involve _some_ kind of metadata just to
update instantiated data.

Does the text mean sync_file_range() might be unreliable for
crash-resistant commits on those filesystems, or do _they_ have
another kind of metadata that is not excluded "the file's metadata"?

I can't tell from the man page what happens on those filesystems.

But a kernel thread from Feb 2008 revealed the truth:
sync_file_range() _doesn't_ commit data on such filesystems.

So sync_file_range() is basically useless as a data integrity
operation. It's not a substitute for fdatasync(). Therefore why
would you ever use it?

> In terms of making it easier to use, some predefined bitfield
> combinations is all that's necessary.

See above. It doesn't work on some filesystems for integrity. It may
still be useful for application-directed writeout scheduling, but I
don't imagine many apps using it for that when we have
fadvise/madvise.

The flag examples aren't as helpful as they first look. Unless you
already know about Linux dirty page writeouts, it's far from clear why
SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER is not a data
integrity sync.

(After all, an aio_fsync_range + aio_suspend _would_ be a data
integrity sync.)

It's obvious when you know about the kernel's queued writeout concept
and pages dirtied after beginning writeout, but that's missing from
the man page.

Also it does too much writing, if a page has been queued for writeout,
then dirtied, _but_ that queued writeout had not reached the I/O
driver when the page was re-dirtied. In that case, it will write the
page twice but once is enough. With long writeout queues in the
kernel and a small file being regularly updated, or simply with normal
background writeout occurring as well, those unnecessary double writes
are a realistic scenario.

So in a nutshell:

The man page could do with an explanation of the writeout queue,
and why SYNC_FILE_RANGE_WAIT_BEFORE is needed for
dirtied-after-queued-for-writeout pages.

The man page could do with defining "metadata", and the
implemention should follow.

Preferably change it to behave like useful fdatasync()
implementations claim to: Exclude inode metadata such as times and
permissions, and include all relevant data-locating metadata
including file size (if in the range), indirection blocks in simple
filesystems, and tree/journal updates in modern ones.

Until the implementation follows, the man page should note that on
COW filesystems it currently guarantees nothing.

The implemention blocks too much when sYNC_FILE_RANGE_WAIT_BEFORE
has to wait for a single page, while it could be queuing others.

With a large range and SYNC_FILE_RANGE_WRITE only - which looks
like it could be asynchronous - does it block because the writeout
queue is full? Or is there no limit on the writeout queue size?
If it can block arbitrarily - how is that any better than
fsync_range() with obvious semantics?

It might be less efficient than fdatasync(), if setting all three
flags means it writes a page twice which was dirtied since the last
writeout was queued but has not hit the I/O driver. The
implementation _might_ be smarter than that (or fdatasync() have
the same inefficiency), but the optimisation is not allowed if you
treat the man page as a specification.

I'll be more than happy to submit man page improvements if we can
agree what it should really do :-)

Btw, I've Cc'd Nick Piggin who introduced a good thread proposing
fsync_range a few months ago. Imho, fsync_range with a flags
argument, and its AIO equivalent, would be great.

> As far as extending the implementation so it calls into filesystem to
> commit data-locating metadata, and other semantics such as "flush on
> next commit, or "flush-when-upcoming-metadata-changes-such-as-a-rename",
> we might need to change the implementation somewhat (or a lot).

Yes I agree. It must be allowed to return ENOTSUP for unsupportable
or unimplemented combinations - after doing all that it can anyway.

For example, SYNC_HARD (disk cache barrier) won't be supportable if
the disk doesn't do barriers, or if it's some device-mapper devices,
or if it's NFS, for example. Unless you configured the filesystem to
lie, or told it the NFS server does hardware-level commit, say.

> But the interface does make a lot of sense. (But maybe that's because
> I've spent too much time staring at all of the page writeback call
> paths, and compared to that even string theory is pretty simple. :-)

Yeah, sounds like you have studied both and gained the proper perspective :-)

I suspect all the fsync-related uncertainty about whether it really
works, including interactions with filesystem quirks, reliable and
potential bugs in filesystems, would be much easier to get right if we
only had a way to repeatably test it.

Just like other filesystem stress/regression tests.

I'm thinking running a kernel inside a VM invoked and
stopped/killed/branched is the only realistic way to test that all
data is committed properly, with/without necessary I/O barriers, and
recovers properly after a crash and resume. Fortunately we have good
VMs now, such a test seems very doable. It would help with testing
journalling & recovery behaviour too.

Is there such a test or related tool already?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/