Re: "raw" block devices?

Linus Torvalds (torvalds@cs.helsinki.fi)
Thu, 17 Oct 1996 21:29:14 +0300 (EET DST)


[ Last message from me on this for a while: I'm going to zonk out. But
it's an interesting topic ]

On Thu, 17 Oct 1996, Ingo Molnar wrote:
>
> > You can handle write ordering by using a log-based database (never overwrite
> > any old data, so write ordering doesn't matter), and do a "fsync()" on the
> > file when you commit. [...]
>
> [ i really dont want to flame ... IMHO it's a very interesting topic which
> should be cleared up ]
>
> This brings up problems like locality. A log-based RDBMS has to give up
> locality only because the kernel cant guarantee ordering? Log based
> filesytems and RDBMSs write fast and read slow. [this is an access pattern
> thing. A typical RDMBS application does more reads than writes]

Oh, see, I personally think that the _on_disk_ organization doesn't
necessarily have to reflect the actual organization of the data in the
application (in this case the database).

Yes, with the log-based setup, the on-disk stuff is not "nicely" organized.
However, how often do you actually need to worry about that? Because the
database isn't centered around the physical location of the blocks of data
(that's why we did the log structure in the first place), the physical disk
location is much less of an issue.

Instead of doing a raw disk access, you follow a pointer in your address
space. Yes, that's kind of oversimplified, but I wouldn't call it completely
unrealistic, especially on machines with more than 40 bits of virtual memory.
And we all have alphas, don't we? ;)

(yes, it gets a lot more complex if you can't assume that all the
database can be mapped in at one time, but people know how to address
those kinds of limitations)

With the above kind of mindset, the physical database file on disk is more
more like a "backup for memory" than a database. The _real_ information is in
memory, and the only worry we have about the database is that if the
application (or machine) crashes, we have to be able to reconstruct it from
the disk image.

> So we have two conflicting constraints [if we accept the current
> non-ordered write-cache as our only cache]: locality and ordering. I would
> say rather lets change the cache behaviour, and lets force ordering at
> that level. And this is how Oracle works [i might be wrong: i have never
> seen their code, i can only judge based on documented things].

See, my opinion is that the caches handle the locality problem. Locality is
what caches are good for, after all. In this context, think of the whole
physical memory as a "cache" for a database that is likely to be an order of
magitude or more larger than the physical memory and possibly (but hopefully
not) larger than the virtual memory, not just a few disk block caches. So the
"cache" for the file is really all of the physical pages that are currently
mapped in the process.

And the ordering is obviously handled by the log file.

Now, you obviously want to re-organize the log-file every once in a while, to
"defragment" it etc. But that's something you'd do anyway (it's called
"making backups" with traditional databases, you just make sure that the new
copy is written out in a saner manner than the original fragmented file ;)

Again, I'd very much like to point out that datasbes is NOT my area of
expertise. I might have missed something fundamental, but I don't see the
error of my ways at least immediately.

Linus