Re: [sqlite] light weight write barriers

From: Vladislav Bolkhovitin
Date: Sat Nov 17 2012 - 00:03:13 EST



Chris Friesen, on 11/15/2012 05:35 PM wrote:
The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.

This sounds interesting as a way to expose some useful semantics to userspace.

I assume we'd need to come up with a new syscall or something since it doesn't
match the behaviour of posix fsync().

This is how I would export cache sync and requests ordering abstractions to the user space:

For async IO (io_submit() and friends) I would extend struct iocb by flags, which would allow to set the required capabilities, i.e. if this request is FUA, or full cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per each iocb.

For the regular read()/write() I would add to "flags" parameter of sync_file_range() one more flag: if this sync is immediate or not.

To enforce ordering rules I would add one more command to fcntl(). It would make the latest submitted write in this fd ORDERED.

All together those should provide the requested functionality in a simple, effective, unambiguous and backward compatible manner.

Vlad

1. See my other today's e-mail about what is immediate cache sync.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/