Re: The Central Mystery

Larry McVoy (lm@neteng.engr.sgi.com)
Thu, 24 Jul 1997 10:30:34 -0700


I don't know how it works in Linux. I can tell you how it works in
SunOS. I think some of the ideas are the same, though Linux' model is or
has been more of a dual track thing with data coming in differently via
the read/write and mmap interfaces.

As an aside, it is very useful to build up a bunch of ctags for the
kernel, including hand crafted ones that take you through the object
function pointers. I.e., suppose the VFS has an interface vop_read().
Pick a file system, like ext2fs, and stick a tage that goes from vop_read
-> ext2fs_read. The details are wrong here but it is useful, extremely
useful, to be able to tag all the way down from sys_read() to the disk
subsystem and back. That is, by the way, how I learned about SunOS and
how I learn each new OS. You walk the I/O paths until you know them
by heart. If it is any consolation, the file I/O path is quite a bit
less complex than the networking I/O path, IMHO.

In SunOS, the VM rewrite made read/write obsolete. All I/O came in
through page faults (VNODE interface VOP_GEETPAGE()) and went out through
inverse pagefaults, known as putpage(). To implement read/write, the
kernel actually mmap-ed the vnode into its own address space and then
did a bcopy. This is architecturely elegant but a performance lose
in general.

OK, so how did I/O come into the system? In SunOS, all data was named data,
named by <inode, offset> pairs; a pair like this got you to a page if it was
in the cache. So suppose you did a read(). That turns into something like

read()
vno_read()
VOP_READ() -> ufs_read()
mmap of the vnode into segmap (special kernel mapping area that was
heavily cached)
uiomove() (basically a bcopy)
...
The uiomove will pagefault on a page not in the cache and that was:
trap()
figure out it is a page fault
as_fault()
go to the address space containing the fault
segvn_fault()
go to the segment handling this part of the as
VOP_GETPAGE()
ask the VFS for the page
ufs_getpage()
This had two paths, minor and major page fault
if (p = page_find(...)) { # it was in the page cache w/o mapping
return (p);
}

off = bmap the offset
ufs_strategy(off, buffer, etc)
biowait(buf);
....
return (p);

On the way back out, segvn gets the page[s] and inserts the TLB
entries to these page[s].

So there's SunOS. I would be interested in a similar brain dump for
Linux...