Why cachefs lives directly on a block device

From: David Howells
Date: Sat Aug 28 2004 - 06:38:19 EST



Andrew Morton <akpm@xxxxxxxx>:
> Maybe I'm being stoopid, but I don't see why this whole cachefs thing
> cannot operate by creating one regular file per netfs file on top of some
> existing underlying filesystem.
>
> Why'd you design it this way?

There are several reasons:

(1) User interface.

I can use mount to add a cache that all interested filesystems can then
just take immediate advantage of. Otherwise I need to find some other way
of doing so. Admittedly, this is a minor point - I could just add a new
pair of syscalls.

(2) Performance.

If I go through another filesystem, then I can't do DMA directly into the
the netfs's pages. Everything _has_ to be copied. Yes, I know directIO
now exists, but it's a userspace feature that I'm not sure I can
use.

It was also made clear to me that I wouldn't be able to change this - ie:
to add operations that start BIOs to read or write from a netfs page
instead of the discfs's page in the pagecache. I can see why it might be
hard to do for most discfs's: they are designed to use buffer heads and
would have to keep track of BIOs in progress to blocks without having
pages around to attach the information to.

Furthermore, we end up going through two lots of readahead calculations,
not one - the netfs does one and the discfs does one.

(3) Memory.

Having to copy to/from discfs pages also has potential implications for
memory usage and memory pressure. You end up using a lot more memory at
certain points - you have to get an extra page to do a read or a write;
so if the VM is trying to dispose of a page that hasn't yet been written
to the cache, it has to get a second page to be able to update the cache,
and it _has_ to update the cache or punch a hole.

Furthermore, every netfs inode in the cache also has to have an inode
around in memory all the time at least, and on a discfs you'd have to
have a struct dentry and probably a struct file too. Cachefs only has to
keep the inode in memory, not the dentry or file structs. Theoretically,
I could probably dispense with the inode too, but it's probably more work
than it's worth.

(4) Holes.

The discfs must support holes. The cache must be able to detect the holes
and report to the netfs that it hasn't yet downloaded the data for that
page. I suppose I could possibly use inode->i_op->bmap()...

I can also punch holes in cachefs files, something that can't be done on
other discfs's at the moment.

(5) Data Consistency.

Cachefs uses a pair of journals to keep track of the state of the cache
and all the pages contained therein. This means that I don't get an
inconsistent state in the on-disc cache and I don't lose disc space.

One place where I take especial care is between the allocation of a block
and its splicing into the usual on-disc pointer tree and the data having
been written to disc. If power is interrupted and then restored, I can
replay the journal and see that a block was allocated but not written and
then punch it out. Being backed by a discfs, I'm not certain what will
happen.

It may well be possible to mark the discfs's journal, if it has one, but
how does the discfs deal with those marks?

Knowing that your cache is in a good state is vitally important if you,
say, put /usr on AFS. Someone we deal with puts everything barring /etc,
/sbin, /lib and /var on AFS and have a humungous cache on every
computer. Imagine if the power goes out and renders every cache
inconsistent, requiring all the computers to nuke their caches when the
power comes back on.

(6) Recycling.

Recycling is simple on cachefs. I can just scan the metadata index to
look for inodes that require reclamation/recycling; and I can also build
up a list of the oldest inodes so that I can nuke them to make space.

Doing this on a discfs would require a search going down through a nest
of directories, and would probably have to be done in userspace.

(7) Disc Space.

I'd want to set a maximum size to the cache, but I can't guarantee being
able to reach that maximum size on a discfs.

If the recycler starts to nuke cache files to make space, the freed
blocks may just be eaten directly by userspace programs, potentially
resulting in the entire cache being nuked. Alternatively, netfs
operations may end up being held up because the cache can't get blocks on
which to store the data.

With cachefs, I can guarantee that I have access to every block.

(8) Users.

Users can't go into cachefs and run amok. The worst they can do is cause
bits of the cache to be recycled early. With a discfs backed cache, they
can do all sorts of bad things to the files belonging to the cache, and
they can do this quite by accident.


There would be some advantages to using a file-based cache rather than a
blockdev-based cache:

(1) Writing to the cache.

Having to copying to or from a discfs's page means that a netfs can just
make the copy and then assume its own page is ready to go.

(2) Doesn't require its own blockdev.

You just nominate a directory and go from there; you don't have to
reparition or install an extra drive to make use of cachefs in an
existing system.

(3) Can use xattrs to store netfs data about a file.

Cachefs requires the netfs to store a key in any pertinent index entry,
and it also permits arbitrary data to be stored there.

A discfs could be requested to store the netfs's data in xattrs, and the
filename could be used to store the key, though the key would have to be
rendered as text not binary. Likewise indexes could be rendered as
directories with xattrs.

(4) You can easily make your cache bigger if the discfs has plenty of space.


One good point, though, I've tried to develop the cachefs-netfs interface
(cachefs.h) so that it is agnostic with respect to how underlying caches
work. This means that if the underlying mechanism changes radically any netfs
that uses it won't have to change.

It should also be possible to change cachefs's interface such that caches of
different types can be mixed. fs/cachefs/interface.c doesn't really care, and
it could be split from cachefs entirely.

If you're going to insist it becomes file-backed, then will you be willing to
lend your support if I want to make discfs's change to make this easier?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/