Buffer and page cache

braam@cs.cmu.edu
Tue, 02 Nov 1999 08:15:36 -0700


Hi,

I'm working on a file system which talks to an "inode disk", the storage
industry calls these object based disks. A simulated object based disk
can be constructed from the lower half of ext2 (or any other file system
for that matter).

The file system has no knowledge of disk blocks, and solely uses the
page cache.

I'd like these pages to age a little before handing them over to the
"inode disk", because the "write_one_page" function called by
generic_file_write would incur significant latency if the inode disk is
"real", ie. not simulated in the same system.

So we have a page cache for the inodes in the file system where the
pages become dirty - but no buffers are attached. It reminds of a
shared mapping, but there is no vma for the pages.

What appears to be needed is the following - probably it's mostly
lacking in my understanding, but I'd appreciate to be advised how to
attack the following points:

- a bit to keep shrink_mmap away from the page. When the file system
writes in this page, we need to change its state so that it doesn't get
thrown out afterwards. We could "get" the page for this purpose.
Locking is not good, since we may need to write to the page again.

- a bit for a struct page that indicates the page needs to be written.
>From block_write_full_page one could think that the PageUptoDate bit is
maybe the one to use. But does that really describe that this page is
"dirty" - as it is done for buffers.

- some indication of aging: we would like a pgflush daemon to walk the
dirty pages of the file system and write them back _after_ a little
while

The construction should hopefully be capable of supporting Stephen's
journaling extensions too, but I can't oversee everything in one blow
(he probably can).

Any advice would be appreciated!

No why are we doing this?

Effectively we have split Ext2 into an upper half (the file system) and
a lower half (the object based device driver).

For cluster file systems it does seem an attractive division of labor to
let the drive do the allocation and have the clustered file system only
share inode metadata and data blocks. So the block and inode allocation
metadata is not spread around the cluster. This saves locks and traffic
and, perhaps most importantly, complexity.

You can find some preliminary code at:
ftp://carissimi.coda.cs.cmu.edu/pub/obd, but currently it writes through
to the disk and doesn't cluster yet. Hence this message.

- Peter -

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/