Re: [PATCH] AFS filesystem for Linux (2/2)

From: David Howells (dhowells@cambridge.redhat.com)
Date: Thu Oct 03 2002 - 16:46:05 EST


Hi Jan,

Do I take it you were (partially) responsible for Coda development? I have to
admit I don't know much about Coda.

> So you want to eventually link kerberos into the kernel to get the
> security right?

That's unnecessary judging by OpenAFS. AFAICT, only the ticket needs to be
cached in the kernel (this is obtained by means of a userspace program), and
then the ticket is passed through the security challenge/response mechanism
provided by RxRPC.

Otherwise, the entire network side of OpenAFS would have to be in userspace
too, I suspect.

It may be possible to offload the security aspects to userspace. I'll have to
think about that.

Besides, I get the impression that NFSv4 may require some level of kerberos
support in the kernel.

> Coda 'solves' the page-aliasing issues by passing the kernel the same file
> descriptor as it is using itself to put the data into the container (cache)
> file. You could do the same and tell the kernel what the 'expected size' is,
> it can then block or trigger further fetches when that part of the file
> isn't available yet.

I presume Coda uses a 1:1 mapping between Coda files and cache files stored
under a local filesystem (such as EXT3). If so, how do you detect holes in the
file, given that the underlying fs doesn't permit you to differentiate between
a hole and a block of zeros.

> We don't need to do it at such a granualarity because of the disconnected
> operation. It is more reliable as we can return a stale copy when we lose
> the network halfway during the fetch.

OTOH, if you have a copy that you know is now out of date, then one could
argue that you shouldn't let the user/application see it, as they/it are then
basing anything they do on known "bad" data.

Should I also take it that Coda keeps the old file around until it has fetched
a revised copy? If so, then surely you can't update a file unless your cache
can find room for the entire revised copy. Surely another consequence of this
is that the practical maximum file size you can deal with is half the size of
your cache.

> Hmm, a version of AFS that doesn't adhere to AFS semantics, interesting.
> Are you going to emulate the same broken behaviour as transarc AFS on
> O_RDWR? Basically when you open a file O_RDWR and write some data, and
> anyone else 'commits' an update to the file before you close the
> filehandle. Your client writes back the previously committed data, which it
> has proactively fetched, but with the local metadata (i.e. i_size). So you
> end up with something that closely resembles neither of the actual versions
> that were written.

What I'm intending to do is have the write VFS method attempt to write the new
data direct to the server and to the cache simultaneously where possible. If
the volume is not available for some reason, I have a number of choices:

 (1) Make the write block until the volume becomes available again.

 (2) Immediately(-ish) fail with an error.

 (3) Store the write in the cache and try and sync up with the volume when it
     becomes available again.

However, with shared writable mappings, this isn't necessarily possible as we
can only really get the data when the VM prods our writepage(s) method. In
this case, we have another choice:

 (4) "Diff" the page in the pagecache against a copy stored in the cache and
     try to send the changes to the server.

Using disconnected operation doesn't actually make this any easier. The
problem of how and when write conflicts are resolved still arises.

There is a fifth option, and that is to try to lock the target file against
other accessors whilst we are trying to write to it (prepare/commit write
maybe).

> Different underlying filesystems will lay out their data differently, who
> says that ext3 with the dirindex hashes or reiserfs, or foofs will not
> suddenly break your solution and still work reliable (and faster) from
> userspace.

Because (and I may not have made this clear) you nominate a block device as
the cache, not an already existing filesystem, and mount it as afscache
filesystem type. _This_ specifies the layout of the cache, and so whatever
other filesystems do is irrelevant.

> Can you say hack.

No need to. I can go direct to the block device through the BIO system, and so
can throw a heap of requests at the blockdev and deal with them as they
complete, in the order they are read off of the disc when scanning catalogues.

> When you can a file from userspace the kernel will give you readahead, and
> with a well working elevator any 'improvements' you obtain really should end
> up in the noise.

Since I can fire off several requests simultaneously, I effectively obtain a
readahead type of effect, and since I don't have to follow any ordering
constraints (my catalogues are unordered), I can deal with the blocks in
whatever order the elevator delivers them to me.

> Intermezzo does the same thing, they even proposed a 'punch hole' syscall to
> allow a userspace daemon to 'invalidate' parts of a file so that the kernel
> will send the upcall to refetch the data from the server.

I don't need a hole punching syscall or ioctl. Apart from the fact that the
filesystem is already in the kernel and doesn't require a syscall, the cache
filesystem has to discard an entire file as a whole when it notices or is told
of a change.

> VM/VFS will handle appropriate readahead for you, you might just want to
> join the separate requests into one bigger request.

Agreed. That would be a reasonable way of doing it. The reason I thought of
doing it the way I suggested is that I could make the block size bigger in the
cache, and thus reduce indexing walking latency for adjacent pages.

> And one definite advantage, you actually provide AFS session semantics.

According to the AFS-3 Architectural Overview, "AFS does _not_ provide for
completely disconnected operation of file system clients" [their emphasis].

Furthermore, the overview also talks about "Chunked Access", in which it
allows files to be pulled over to the client and pushed back to the server in
chunks of 64Kb, thus allowing "AFS files of any size to be accessed from a
client".

Note that 64Kb is also a "default" that can be configured.

It also mentions that the read-entire-file notion was dropped, giving some of
the reasons I've mentioned.

> And my current development version of Coda has {cell,volume,vnode,unique}
> (128 bits), which is the same size as a UUID which was designed to have a
> very high probability of uniqueness. So if I ever consider adding another
> 'ident', I'll just switch to identifying each object with a UUID.

Does this mean that every Coda cell is issued with a 4-byte ID number? Or does
there need to be an additional index in the cache?

> How about IPv6?

These were just examples I know fairly well to illustrate the problems.

> Or you could use a hash or a userspace daemon that can map a fs-specific
> handle to a local cache file.

You still have to store a hash somewhere, and if it's stored in a userspace
daemon's VM, then it'll probably end up being swapped out to disc, and it may
have to be regenerated from indices every time the daemon is restarted (or
else your cache has to be started afresh.

Thanks for your insights though.

Cheers,
David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Oct 07 2002 - 22:00:41 EST