Re: libata FUA revisited

From: Ric Wheeler
Date: Thu Feb 22 2007 - 17:42:25 EST


Jens Axboe wrote:
On Wed, Feb 21 2007, Tejun Heo wrote:
[cc'ing Ric, Hannes and Dongjun, Hello. Feel free to drag other people in.]

Robert Hancock wrote:
Jens Axboe wrote:
But we can't really change that, since you need the cache flushed before
issuing the FUA write. I've been advocating for an ordered bit for
years, so that we could just do:

3. w/FUA+ORDERED

normal operation -> barrier issued -> write barrier FUA+ORDERED
-> normal operation resumes

So we don't have to serialize everything both at the block and device
level. I would have made FUA imply this already, but apparently it's not
what MS wanted FUA for, so... The current implementations take the FUA
bit (or WRITE FUA) as a hint to boost it to head of queue, so you are
almost certainly going to jump ahead of already queued writes. Which we
of course really do not.
Yeah, I think if we have tagged write command and flush tagged (or
barrier tagged) things can be pretty efficient. Again, I'm much more
comfortable with separate opcodes for those rather than bits changing
the behavior.

ORDERED+FUA NCQ would still be preferable to an NCQ enabled flush
command, though.

Another idea Dongjun talked about while drinking in LSF was ranged
flush. Not as flexible/efficient as the previous option but much less
intrusive and should help quite a bit, I think.

But that requires extensive tracking, I'm not so sure the implementation
of that for barriers would be very clean. It'd probably be good for
fsync, though.


If we could invent any mechanism, it would seem that it would be nicest if we could have independent sequences of IO requests (say with a distinct tag per sequence) and an ability to issue a per sequence flush request. That might tie into the QOS support, but would still have issues when you try to map it back up the stack through the journal and into application level promises of data integrity.

For example, in data journal mode, we would probably need to flush not only the transaction level data, but also all data sequences that had IO's in that transaction first.

Pretty rapidly, we start to get into the database notions of nested transactions and so on ;-)

ric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/