Re: The Central Mystery

Larry McVoy (
Thu, 24 Jul 1997 10:30:34 -0700

I don't know how it works in Linux. I can tell you how it works in
SunOS. I think some of the ideas are the same, though Linux' model is or
has been more of a dual track thing with data coming in differently via
the read/write and mmap interfaces.

As an aside, it is very useful to build up a bunch of ctags for the
kernel, including hand crafted ones that take you through the object
function pointers. I.e., suppose the VFS has an interface vop_read().
Pick a file system, like ext2fs, and stick a tage that goes from vop_read
-> ext2fs_read. The details are wrong here but it is useful, extremely
useful, to be able to tag all the way down from sys_read() to the disk
subsystem and back. That is, by the way, how I learned about SunOS and
how I learn each new OS. You walk the I/O paths until you know them
by heart. If it is any consolation, the file I/O path is quite a bit
less complex than the networking I/O path, IMHO.

In SunOS, the VM rewrite made read/write obsolete. All I/O came in
through page faults (VNODE interface VOP_GEETPAGE()) and went out through
inverse pagefaults, known as putpage(). To implement read/write, the
kernel actually mmap-ed the vnode into its own address space and then
did a bcopy. This is architecturely elegant but a performance lose
in general.

OK, so how did I/O come into the system? In SunOS, all data was named data,
named by <inode, offset> pairs; a pair like this got you to a page if it was
in the cache. So suppose you did a read(). That turns into something like

VOP_READ() -> ufs_read()
mmap of the vnode into segmap (special kernel mapping area that was
heavily cached)
uiomove() (basically a bcopy)
The uiomove will pagefault on a page not in the cache and that was:
figure out it is a page fault
go to the address space containing the fault
go to the segment handling this part of the as
ask the VFS for the page
This had two paths, minor and major page fault
if (p = page_find(...)) { # it was in the page cache w/o mapping
return (p);

off = bmap the offset
ufs_strategy(off, buffer, etc)
return (p);

On the way back out, segvn gets the page[s] and inserts the TLB
entries to these page[s].

So there's SunOS. I would be interested in a similar brain dump for