Re: questions on page cache and buffers

Colin Plumb (
Thu, 7 Aug 97 01:52:20 MDT

I was figuring this out myself. I'm in the process of writing this
up properly (in Documentation/vfs.txt), but here's a quick overview...

There are a number of function pointers that a file system can
implement as part of a read() call, and generic_ versions of
them that most file systems use.

The buffer cache is a "physically indexed" cache; the buffers
(512 bytes to 4K bytes, each sharing a page with other buffers of
the same size and each with a buffer_head structure) are indexed
by device and block number. Generally only meta-data pages (inodes,
bitmaps, superblock, etc.) are here.

The page cache is "virtually indexed" based on inode and offset.
File data is generally stored here. I/O to the physical device
checks the buffer cache and uses it if the data is there, but 99%
of the time, the data is not there.

There's a file->read pointer, which most file systems point to
generic_file_read (in mm/filemap.c) which does some hairy read-ahead
manipulation, and ends up calling inode->readpage().

Most file systems also use the generic_readpage function, which
is implemented in terms of inode->bmap(), which they *do* provide.
generic_readpage breaks the page up into buffers of the appropriate
size for the file system and calls inode->bmap() to find the
location on disk of each of them. There are three cases
- the block number is 0, meaning it should be zero-filled.
(currently, all-zero pages don't share storage with the
system all-zero page, but this could be changed)
- The block is in the buffer cache, in which case it is copied over, and
- The block is nowhere to be found (the usual case). For all of these
blocks, generic_readpage (and generic_writepage, for that
matter) allocate a temporary buffer_head and use it to start the I/O.

When the last buffer_head finishes I/O, the buffer_head chain is
deallocated and the page is marked present. (free_async_buffers())

Most of these shenanigans take place in fs/buffer.c

Note that while changes made to the page cache data get reflected to the
buffer cache eventually, the reverse is NOT guaranteed to happen.
Accessing a raw device (through the buffer cache) while it is mounted
(and parts of its data are in the page cache) is generally NOT a good
idea. (It IS safe as long as you stick to meta-data and avoid
any actual file data. I'm not sure about directories.)

Anyway, read() write() and mmap() all use the page cache for the
"main" data, at least in the usual case.

If you're implementing something, be sure to choose the level at which
you hook into the read() system properly.

For a memory/swap file system, I'm not quite clear on what you need,
but it's a combination of the "private" mapping code (with the
file_private_mmap vm_operations_struct so pages get sent to swap space)
but you want your pages to be indexed in the page cache. That may
require some tweaking of the swap_in code.

I think you just want everything in the page cache, some completely
anonymous pages for metadata, and some indexed pages for file data.
Doing that, 99% of file accesses will be hits in the page cache and
you don't have to worry about I/O to them at all.