Re: Linux 2.6.29

From: Alan Cox
Date: Fri Mar 27 2009 - 13:21:26 EST

Next message: Eric Lacombe: "[x86_64] /dev/kmem"
Previous message: Jeremy Fitzhardinge: "Re: [PATCH 0/5] swiotlb: changes for powerpc/highmem"
In reply to: Matthew Garrett: "Re: Linux 2.6.29"
Next in thread: Linus Torvalds: "Re: Linux 2.6.29"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

O> If user applications should always check errors, and if errors can't be
> reliably produced unless you fsync() before close(), then the correct
> behaviour for the kernel is to always flush buffers to disk before
> returning from close(). The reason we don't is that it would be an

You make a few assumptions here

Unfortunately:
- close() occurs many times on a file
- the kernel cannot tell which close() calls need to commit data
- there are many cases where data is written and there is a genuine
situation where it is acceptable over a crash to lose data providing
media failure is rare (eg log files in many situations - not banks
obviously)

The kernel cannot tell them apart, while fsync/close() as a pair allows
the user to correctly indicate their requirements.

Even "fsync on last close" can backfire horribly if you happen to have a
handle that is inherited by a child task or kept for reading for a long
period.

For an event driven app you really want some kind of threaded or async
fsync then close (fbarrier isn't quite enough because you don't get told
when the barrier is passed). That could be implemented using threads in
the relevant desktops libraries with the thread doing

fsync()
poke event thread
exit

(or indeed for most cases as part of the more general
write-file-interact-with-user-etc call)

> If every application that does a clobbering rename has to call
> fbarrier() first, then the kernel should just guarantee to do so on the

Rename is a different problem - and a nastier one. Unfortunately even in
posix fsync says nothing about how metadata updating is handled or what
the ordering rules are between two fsync() calls on different files.

There were problems with trying to order rename against data writeback.
fsync ensures the file data and metadata is valid but doesn't (and
cannot) connect this with the directory state. So if you need to implement

write data
ensure it is committed
rename it
after the rename is committed then ...

you can't do that in POSIX. Linux extends fsync() so you can fsync a
directory handle but that is an extension to fix the problem rather than
a standard behaviour.

(Also helpful here would be fsync_range, fdatasync_range and
fbarrier_range)

> application's behalf. ext3, ext4 and btrfs all effectively do this, so
> we should just make it explicit that Linux filesystems are expected to
> behave this way.

> If people want to make their code Linux specific then that's their problem, not the kernel's.

Agreed - which is why close should not happen to do an fsync(). That's
their problem for writing code thats specific to some random may happen
behaviour on certain Linux releases - and unfortunately with no obvious
cheap cure.

--
"Alan, I'm getting a bit worried about you."
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Eric Lacombe: "[x86_64] /dev/kmem"
Previous message: Jeremy Fitzhardinge: "Re: [PATCH 0/5] swiotlb: changes for powerpc/highmem"
In reply to: Matthew Garrett: "Re: Linux 2.6.29"
Next in thread: Linus Torvalds: "Re: Linux 2.6.29"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]