characteristics of a "raw" device (scsi sharing, databases)

Steven S. Dick (ssd@nevets.oau.org)
Sat, 29 Mar 1997 03:16:49 EST


I'm not an expert in this field, but I think I have a general
understanding of why raw devices might be useful to a database. Also, I
think raw devices would be beneficial to SCSI device sharing, as
discussed in another thread. At the same time, I feel that raw devices
are kinda blecherous, and I really don't like the idea in general.
I think there should be an acceptable middle ground.

So, I'd like to take the approach of a compromise. Here's what I see as
a list of the behaviors a semi-raw device should do in various
circumstances that a raw device would be used if it was available...

First, I'll cover the case of read caching on a buffered device.
I believe this is the easier and more obvious case.
I think that a normal disk device could be used here, with one or
more ioctl's added to alter read caching behavior as follows:

r1. In the case of a shared SCSI device, there needs to be a mechanism
to tell the OS to throw away read cached blocks as invalid, as the
other masters on the SCSI bus may have written new data to the block.
If there is higher level caching going on too, then invalidating
that cache needs to be negotiated at in that layer as well.

r2. In the case of a database or other complex application accessing
the disk, the application should be able to tell the buffer cache
to move specific blocks to the top of the list of memory pages to
be reused, so that access patterns unexpected by the OS don't
cause worst case cache behavior. madvise is probably not enough
here unless it can address specific pages or ranges of pages.

r3. If the application wants to cache read pages itself, it should be
possible for the OS and the application to share the same pages in
cache, possibly via mmap or some similar mechanism. (This is
already implemented, I believe.)

I don't see a need in any of these cases for a pure raw read device.
I think that mechanisms to invalidate or control the aging of read
cache data should be enough.

If there are other read cases I've not considered, please address them.

Now, the case of semi-raw writing is much nastier...

w1. In the case of SCSI device sharing, the blocks being written to
probably will need to be negotiated for by some mechanism at a
higher level and possibly in a different medium (ethernet?
high speed serial? direct host to host SCSI communications?), and
locked once negotiated, to prevent multiple writes or premature
reads of stale data. Once the blocks to be modified are negotiated
for, I don't think it matters when they get actually written to,
except that in some circumstances, the other machine may be waiting
for the write to complete, so it should be flushed ASAP to prevent
deadlocks or performance bottlenecks.

The case of a database write, however, is far more complex, and, at
least to me, it seems like it might actually be easier to implement real
raw writes than to build a queuing system that would be acceptable to a
database. However, if such a system were built, I imagine it would need
at least the following semantics (anyone expert in databases might be
able to point out cases I've missed here...)

w2. Some metadata blocks are critical, and should be written immediately.
For instance, in Solaris 2, the superblock has a 'dirty' flag.
When all dirty filesystem blocks are written to disk the dirty flag
is cleared, even if the filesystem is still mounted. However, as soon
as new dirty blocks are generated (or perhaps just before a partial
write is started), the dirty flag is set again, and the superblock
immediately written back out.

Of course, there may be more than one of these "write immediately"
datablocks, so they would have to be queued or prioritized.

w3. Some kinds of metadata and data can be delayed indefinitely, but
need to be written in a specific order. Perhaps, there are chunks
of this type of data where order is not internally important.
Some form of tagged queuing or prioritized queuing might work
here, allowing ordered markers between internally unordered chunks,
and possibly allowing new data to be added to unwritten chunks.

w4. Of course, it is necessary to have a "write everything now" call,
such as sync().

w5. Going back to shared SCSI devices...
If, for instance, we negotiated for a block we will eventually
write to, and the other system originally decided that it didn't
need the block...and then later, the second system changes its
mind, there needs to be a way to say "OK, write that delayed block
ASAP", again to prevent deadlocks and performance bottlenecks.
The actual negotiation for blocks on shared devices should probably
occur at a higher level than the buffer cache--either user level or
at filesystem level; however, there still needs to be a buffer cache
level mechanism to change the priority of or immediately flush either
single blocks or entire groups of blocks.

If done carefully, these types of behaviors could probably be implemented
with a few ioctl's or system calls on top of the existing buffered disk
devices.

Does this make sense? Would this kind of behavior be an acceptable
compromise between those who don't want raw devices, and those that say
that kernel buffering is not currently acceptable? Are there any
situations I've left out?

Steve
ssd@nevets.oau.org