Re: True fsync() in Linux (on IDE)

From: Hans Reiser
Date: Mon Mar 22 2004 - 14:34:09 EST


Matthias Andree wrote:

Jens Axboe schrieb am 2004-03-22:



There's no such thing as atomic writes bigger than a sector really, we
just pretend there is. Timing usually makes this true.


Can you explain about the timing?


If there is no such atomicity (except maybe in ext3fs data=journal or
the upcoming reiserfs4 - isn't there?), then nobody should claim so.

Well, nobody is going to use anything except reiser4 are they?;-).....

I think that we are able to guarantee that the write is fully atomic regardless of what the block layer does, so long as the block layer respects our ordering and does not cache it where it should not.

zam, you are watching this thread about flushing the ide cache I hope....

If
the kernel cannot 100.00000000% guarantee the write is atomic, claiming
otherwise is plain fraud and nothing else.

Some people bet their whole business/company and hence a fair deal of
their belongings on a single data base, and making them believe facts
that simply aren't reality is dangerous. These people will have very
little understanding for sloppiness here. Linux has no obligation to be
fast or reliable, but it MUST PROPERLY AND TRUTHFULLY state what it can
guarantee and what it cannot guarantee.



For bigger atomic writes, 2.4 SUSE kernel had some nasty hack (called
blk-atomic) to prevent reordering by the io scheduler to avoid partial
blocks from databases.



That does not make a write atomic if the scheduled blocks are still
written one at a time (and I believe tagged command queueing won't help
to unroll partial writes either).

If the hardware support is missing, it is prudent to say just that and
not make any bogus promises about platter inertia and "it usually
works". (who says that the filter curves adjust to the decreasing
platter speed and the electronics are sustained for long enough? how
about write verify and remapping broken blocks?)

So we only write one hardware block size atomically, usually 512 bytes
on ATA and SCSI disk drives (MO might do 2048 at a time, but why
introduce complexity). That's a data point in this whole fsync()
discussion.





--
Hans

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/