Re: msync() behaviour broken for MS_ASYNC, revert patch?

From: Nick Piggin
Date: Fri Feb 10 2006 - 12:57:18 EST


Linus Torvalds wrote:

On Sat, 11 Feb 2006, Nick Piggin wrote:

No, you are thinking about what the kernel does. Subtle difference. A
smart user wants to:

- start writing this
- start writing that
- start writing that-other-thing
- make sure this that and the other have reached backing store

OK so in effect it is the same thing, but it is better to export the
interface that reflects how the user interacts with pagecache.

WRITE_SYNC obviously does the "wait for them all" (aka ensure they
hit backing store) thing too, right? It performs exactly the same
role that WRITE_WAIT would do in the above example.


NOOOOOO!

Think about it for a second. Think about the usage case you yourself were quoting.

The "magic" in IO is "overlapping IO". If you don't get overlapping IO, your interfaces are broken. End of story.

And WRITE_SYNC _cannot_ do overlapping IO.


What do you mean by overlapping?

fadvise(fd, 100, 200, FADV_WRITE_ASYNC);
fadvise(fd, 300, 400, FADV_WRITE_ASYNC);
fadvise(fd, 100, 200, FADV_WRITE_SYNC);
fadvise(fd, 300, 400, FADV_WRITE_SYNC);

Will do exactly the same as Andrew's

fadvise(fd, 100, 200, FADV_ASYNC_WRITE);
fadvise(fd, 300, 400, FADV_ASYNC_WRITE);
fadvise(fd, 100, 200, FADV_WRITE_WAIT);
fadvise(fd, 300, 400, FADV_WRITE_WAIT);


It's entirely possible that somebody else (or that very same program) has dirtied the same pages that you started write-out on earlier. And that is when "wait for writes to finish" and "WRITE_SYNC" _differ_.


Yeah they do differ but if you are using sync writes then you obviously
have some data integrity requirements and you _know_ who is writing to
your file and when. That's my point. You're thinking kernel mode. The
userspace requirement for sync writes is "this has reached backing store".

If you want synchronous writes, use synchronous writes. But if you want asynchronous writes, you do _not_ implement them as "start writes now" and "write synchronously". You implement them as "start writes now" and "wait for the writes to have finished".


_You_ do, yes. You are a kernel hacker. You implement synchonous writes.
Implementing synchronous writes is what you do.

Userspace does not care. They use synchronous writes to guarantee it has
hit backing store. They've managed quite nicely up until now without having
your implementation details exposed to them (when is a page dirty? when is
it "under writeout"? who cares? I just want to know if it is on backing
store or not).

There's another very specific and important difference: "wait for the writes" is fundamentally an interruptible and pollable operation, which means that it's a lot easier to integrate into any system that has to do other things too. In contrast, WRITE_SYNC is _neither_ easily interruptible nor pollable.


It is just as easy as WRITE_WAIT to do both. In the pollable case you
just need another flag to say you don't want to block, same as would
be required for WRITE_WAIT.

Seems like you're clutching for straws here.

So WRITE_SYNC has clearly different behaviour. There's a good reason the kernel internally has "start write" + "wait for write", and I'll repeat: none of those reasons go away just because you move to user space.


My proposal isn't really different to Andrew's in terms of functionality
(unless I've missed something), but it is more consistent because it
does not introduce this completely new concept to our userspace API but
rather uses the SYNC/ASYNC distinction like everything else.


Your proposal has two _huge_ downsides:


I was still talking about new additions to fadvise here, not the msync stuff.

- it changes semantics, and you have absolutely _no_ idea of who depends on the performance semantics of the old behaviour. In contrast, I can tell you that we did it once before, and we reverted it.

- it's not at all consistent. The _current_ behaviour is consistent, and matches 100% the current behaviour of sync vs async write().

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/