Re: Sharing SCSI disks

Ingo Molnar (
Fri, 21 Mar 1997 10:22:37 +0100 (MET)

On Thu, 20 Mar 1997, David S. Miller wrote:

> The right solution is simply to add raw devices to the Linux
> kernel, it doesn't look too bad or alternatively to add a raw
> semantic to the existing ones (which is basically how you'd do
> either)
> I brought up this with Linus once, the conversation was in reference
> to how we thought we could do on database benchmarks etc. which is
> essentially all over raw block devices these days. (which we both
> agreed was entirely stupid, the kernel should be doing buffer caching,
> not some silly Oracle disk I/O layer) In any event, it is just
> essentially page flipping every request to dma out to the disk. The
> cpu never touches any of this stuff (in the kernel that is).

our current mmap+page cache semantics and behaviour does pretty much this.
you do an msync() and the CPU never even touches that memory, it's DMA-ed
out to disk directly.

[hmm that RAW device thread warming up again ;)]

the only thing i think that's missing IMHO is correct syncronization
between writeouts. RDBMSs pretty much rely on certain data touching
persistant storage in a certain order. (it doesnt have to be totally
ordered, but some level of ordering is needed). Say if Linux decides to
first write to the data device then to the log device and we crash shortly
after data touches the data device, the server cannot recover.

the io writeout and log daemons and the LRU cache in Oracle ensure that
there is some (loose) ordering between log and data IO. data can be
arbitary ordered with data, and log with log, but strict cross-ordering.

if you have the whole thing mmap()-ed, you cannot really guarantee which
data gets written out when, and in what order. mlock() isnt too cool

there are RDBMS architectures that fit into our 'simple, one level caching
virtual memory' concept better (log-structured RDBMSs), but for me they
are the same category as microkernels.

What i think could be done is the following technique:

- each vmarea is totally unordered in itself
- but we can define 'ordering' between vmareas.

an example. Let the log area be vmarea1, the data area vmarea2. We tell
the kernel via sys_order(vmarea1,vmarea2,VM_ORDERING_STRICT) that
vmarea1 is the 'first', vmarea2 is the 'second'.

what the kernel does: it ensures that IO requests from vmarea1 never get
'overtaken' by IO request from vmarea2. [see about races and deadlocks
later on]

The RDBMS does the following when doing a transaction (much simplified
...): writes to vmarea1 (log data), writes to vmarea2 (tables are here).

THATS all. No write(), no fsync(), no msync(). No IO semantics needed.
Pure virtual memory. User-space only has to order >memory operations<. The
rest is handled by the Linux VM layer.

now, the burden is on the kernel. What does it do. It balances dirty/clean
memory, by issuing writes occasionally. The 'only' thing it has to watch
for is that it has to completely clean vmarea1 before doing >any< write
from vmarea2. Think of vmarea1 and vmarea2 as one vmarea (but with parts
which have different characteristics). To prevent low memory conditions
and races, vmarea2 can be swapped out to a real swap device (which then
serves as a virtual log device ... and if we want to be tricky we could do
a device-to-device direct DMA transfer, to transfer a block from the swap
device to the data device ...). To make writeouts atomic enough, we might
want to add BOW capability to vmareas. (Block On Write).

additionally, we might want to add special writeout-policy flags to
vmareas ... for example, for log areas we could add 'VM_IO_CIRCULAR',
which tells the kernel that the area should be written in a circular

i'm not sure yet what the 'right' solution is, but something like the
above is doable IMHO. We shouldnt restrict the ordering feature to a
simple vmarea-vmarea relation, it should be a generic ordering-tree, which
tree then is walked at swapout time (ie. if the swapout code finds some
vmarea to be swapped out, it has to ensure that the vmareas 'above' it are
all clean).

If all the above works out right, then we can have RBMSs through virtual
memory only. The IO management complexity would be in the kernel. The
RDBMS core only has to parse SQL queries, has to optimize them and has to
crunch away. No system calls necessary at all. (for the IO part).

and in a networked database, it's even better to do it in the kernel. We
could put the abstraction barrier into the kernel, user-space only sees
some kind of virtual memory.

-- mingo