Re: ext3-2.4-0.9.4

From: Matthias Andree (matthias.andree@stud.uni-dortmund.de)
Date: Thu Jul 26 2001 - 09:32:23 EST


On Thu, 26 Jul 2001, Alan Cox wrote:

> Rik is right. It isnt just about premature notification - its about
> atomicity. At the point you are notified the data has been queued for disk
> I/O. Even on traditional BSD ufs with synchronous metadata you still had
> points where a crash left the rename partially complete and nothing but a
> log or an atomic update system is going to fix that.

No. Atomic update systems and logs can by no means fix premature
acknowledgements:

Proof:

Assume the OS has a phase tree kind of thing or log that requires
just a single-block write for an atomic rename.

Assume an MTA calls rename(), and the OS by whatever means notifies it of
completion, but actually, the data is only queued, not written.

Assume The MTA receives the acknowledgement (e. g. rename call
returned), sends a "250 mail action complete" packet across the network.

Assume the machine sends the network packed, but not the queued disk
block and then crashes.

--> The single block is lost, the rename operation is lost, but the
operation had been acknowledged. Consequence: the mail is lost. q. e. d.

All this boils down to:

1. The OS _MUST_ know when a write operation has been physically
committed to non-volatile storage.

2. The OS _MUST_ _NOT_ acknowledge the (assumedly synchronous operation)
any earlier. (This may well include switching off drive write
buffering.)

If the OS cannot fulfill these two basic requirements, I can save all
the log or FS atomicity efforts because they don't get me anywhere.

The problem is not that the operation can fail, the problem IS premature
acknowledgement. Even with atomic updates, as shown above.

Note, of course there is no premature acknowledgement for the
Linux-default asynchronous directory update. There IS for -o sync or
chattr +S -- and that's what MTAs to to guarantee data integrity, and
that's why I'm still suggesting dirsync or something to remedy the
negative data write performance of full-sync.

If the OS tell me "write completed" when it means "I queued your data
for writing", it is BROKEN.

That's my point.

And since the common POSIX OS lacks a dedicated notification feature for
e. g. rename, MTAs have no other choice than to rely on "has completed
when the syscall returns".

BTW, my Linux rename(2) man page doesn't document EIO condition, FreeBSD
4.3-STABLE and SUS v2 do.

-- 
Matthias Andree
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Jul 31 2001 - 21:00:27 EST