Re: "raw" block devices?

Alessandro Suardi (asuardi@uninetcom.it)
Thu, 17 Oct 1996 04:39:16 +0200


Ingo Molnar wrote:
>
> On Thu, 17 Oct 1996, Linus Torvalds wrote:
>
[snip]
> >
> > Sure, there are old-fashioned databases that think they can do a better job
> > of it than the kernel does. They are usually wrong, I suspect. They are using
> > raw devices more for historical reasons than anything else, and they could
> > just as well use a filesystem.
>
> [yes, raw devices are a hack, still RDBMS ppl use it because:]
>
> one not-so obvious problem is that an RDBMS >has< to implement a
> write-cache for itself. Thus if the block device would be buffered too (in
> the kernel), then we had double buffering. [as it is buffered now]
>

[snipped good args about kernel writecache and RDBMSs]

Oracle has since 7.1.6 (1993 if I don't remember incorrectly) distributed
with the software the "Unix Tuning Tips" or something alike. One interesting
thing is the argument that 'newer filesystem implementations reduce the gain
of bypassing the buffer cache via raw device'. So Linus, it was becoming an
admittedly historical reason some time ago ;)

There are actually a few more reasons why Oracle may prefer raw devices.

1. Few Unices support AIO on filesystem. AIO usually delivers noticeable
gains for large databases, it also zeroes out potential synchronization
problems one may have using Sync IO and multiple db writer processes.

2. Oracle Parallel Server (which runs on clustered machines and MPPs) opens
a set of data files from different nodes, this happens in strict coope-
ration with the Unix vendors' extensions to their OSs due to basically
allowing shared disk access from different nodes via high bandwidth
network devices. These extensions are provided with support for raw
devices only (I have only seen the IBM SP/2, Sun's PDB SPARCCluster and
smaller clusters made up by two HP9000s or two RS6000). There must also
be a distributed locking policy (no more than one Oracle instance can
have an Oracle block in its own buffer cache) which is implemented in
either user space using IPC shmem or kernel space. Eventually all the
DLMs (Distributed Lock Managers) will be provided by Oracle and will be
implemented in user space to avoid kernel bloat.

Now to why Linux has brilliant IO performance on ext2 and commercial OSs a
little less we followed a bit ago the IO ordering algo debate.
I mailed the Oracle8 Beta webmaster 3 weeks ago asking politely for the
chance for a Linux port... never got any reply :(

--alessandro (who is not speaking for his employer, you may guess who)