Re: [PATCH 1/7] block: Add block_flush_device()

From: Ric Wheeler
Date: Tue Mar 31 2009 - 11:46:34 EST


Linus Torvalds wrote:

On Tue, 31 Mar 2009, Ric Wheeler wrote:
Now you are just being silly. The drive and the write cache - without barriers
or similar tagged operations - will almost certainly reorder all of the IO's
internally.

You do realize that the "drive" may not be a drive at all?

But apparently you don't. You really seem to see just your own case, and have blinders on for everything else.

That "drive" may be some virtualized device. It may be some super-fancy memory mapped and largely undocumented random flash thing. It might be a network block device, it may be somebody's IO trace dummy layer, it may be anything at all.

Of course I realize that.

Most of the SSD devices, including ones that don't speak normal S-ATA/SCSI/etc, they have a write cache and will combine and re-order IO's.

Some of them have non-volatile write caches and those don't need barriers (flush, fua, what ever) because of batteries, capacitors or other magic hardware people came up with.

For the ones that do have a volatile write cache and can reorder IO's, transactions will still need the ordering primitives to survive a power failure reliably.

If you don't need or want to pay the price of ordering, you can today easily disable this by mounting without barriers.

As Mark pointed out, most S-ATA/SAS drives will flush the write cache when they see a bus reset so even without barriers, the cache will be preserved (or flushed) after a reboot or panic. Power outages are the problem barriers/flushes are meant to help with.


Your filesystem doesn't know. It damn well not even _try_ to know, because it isn't the low-level driver.

The low-level driver - which you don't have a friggin clue about - may say that it doesn't support barrier IO for any random reason that has absolutely _nothing_ to do with any write caches or anything else. Maybe the device has the same ordering semantics as an Intel CPU has: writes are always seen in order on the disk, and reads are always speculated but will snoop in write buffers, and ther is no way to not do that.

See? EOPNOTSUPP means just that - it means that the driver doesn't support the notion of ordered IO. But that does not necessarily mean that the writes aren't always in order. It may well just mean that the drive is a thin shimmy layer over something else (for example, just a user level pipe), and the driver has NO IDEA what the end result is, and the protocol is simplistic and is just 'read' and 'write' and absolutely nothing else.

But you seem to NOT UNDERSTAND THIS.

I'm not interested in your inane drivel. Let's just say that your lack of understanding just means that your input is irrelevant, and leave it at that. Ok? Until you can see the bigger picture, just don't bother.

Linus


If the low level device returns EOPNOTSUPP on a barrier op, that is fine. Running a transactional file system on that storage might or might not be a good idea, but at least we can log that and move on.

I agree with Chris that what happens when the device does not support the primitives is not the core issue.

The question is really what we do when you have a storage device in your box with a volatile write cache that does support flush or fua or similar. Using barriers & ordered transactions for these types of devices will give you a more reliable file system - less fsck time needed and better data integrity support for the (few?) applications that use fsync properly.


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/