Re: Raw devices (Was:Re: NTFS, FAT32, etc.)

Tim Hollebeek (
Thu, 8 May 1997 16:49:59 -0400 (EDT)

On Thu, 8 May 1997, Martin von Loewis responded to:

> > What heavy iron databases seem to want is the ability to schedule
> > all their own I/O using AIO in blocksized buffers with the kernel
> > side doing the I/O direct to/from the given buffer in user space.
> > No buffer copying. No kernel memory wasted in buffering. No read
> > ahead other than that specifically done by the program.


> OK, I understand the issue of the unnecessary copies - although I doubt
> that systems with 'raw devices' directly 'DMA to user space' when writing
> to an SCSI disk.
> However, I thought that databases are interested in guaranteed completion,
> i.e. once write(2) returns, the system should guarantee that the data is
> really on disk. This is necessary for the transactional properties. Without
> such a guarantee, you can pretty much forget about transactional recoveries
> after a crash. Wouldn't O_SYNC give you the same properties? As for real
> implementations: Does anybody know whether the Adabas or Postgres Linux
> ports do use O_SYNC?
> Thanks,
> Martin
databases are definitely interested in guaranteed completion. however
when the device chooses to let you know that something has gone awry is
really between the device and its control unit and the control unit's
brain which, i suggest, is the driver.

please excuse the ignorance of linux which the following generics
will demonstrate. i think the fundamentals may make sense.

in the large iron database environment, despite what language you
may choose to access data from a device, there always exists a
method by which you can avoid your process becoming non-dispatchable
as a consequence of an io to a (perhaps not so intelligent) device.

within a multiple thread database, it is undesirable to use up all
the threads waiting on something to happen that is outside
the purview of the database itself. when all the physical engines
are tied up, so are you. if the wait is within the database
you have an opportunity to do something about it.

the trick, IMO, is to tie an event to the io which you can check on
periodically without waiting on something outside of userland.
the consequence of waiting on events beyond your view seems
to be stall prone.

who hasn't noticed that when pppd is using chat to dial your
telephone; not too much else happens?

when the device discovers what the ultimate fate of the io is,
it usually, by convention, will notify the controller who will,
assuming aribtrary coding conventions in the driver, post the
event upon which your thread is waiting. check the memory word.
if all is cool, move forward. otherwise common convention accross
devices as to what the result descriptor of the io looks like
gives you the opportunity to check bits here and there and know
exactly what happened without stalling your process which might
be the main database process.

scsi tends to do many things well at once.

i see it as the weapon of choice for people who need high
performance databases and can't afford RAID solutions.

give me a port to the data and a way to drive the device asynchronously
and i may not move the world but i'll probably move the heads.