Re: [PATCH V2 10/16] Squashfs: cache operations

From: Phillip Lougher
Date: Fri Oct 31 2008 - 00:43:58 EST


JÃrn Engel wrote:
On Wed, 29 October 2008 01:49:56 +0000, Phillip Lougher wrote:
+/*
+ * Blocks in Squashfs are compressed. To avoid repeatedly decompressing
+ * recently accessed data Squashfs uses two small metadata and fragment caches.
+ *
+ * This file implements a generic cache implementation used for both caches,
+ * plus functions layered ontop of the generic cache implementation to
+ * access the metadata and fragment caches.
+ */

I tend to agree with Andrew that a lot of this should be done by the
page cache instead. One of the problems seems to be that your blocksize
can exceed page size and there really isn't any infrastructure to deal
with such cases yet. Bufferheads deal with blocks smaller than a page,
not the other way around.


I thought about using the page cache, but, the fact that blocksizes exceed the page cache size is only one of a number of reasons why I prefer the current solution, there is also simplicity and speed to consider.

There are three types of compressed block in Squashfs: datablocks, fragments, and metadata blocks. Of these datablocks (by far the largest number of blocks) are decompressed and pushed into the page cache, and are not otherwise cached by Squashfs. This, obviously (?), is because they contain data for only one file, and so at time of access there is a readily available address space to push the data into.

Metadata and fragment blocks are different in that when accessed and decompressed (say for an inode or for a particular tail-end block) they will contain metadata or tail-ends for other files. This data could be thrown away but due to locality of reference it makes sense to temporarily cache it in-case a near future file access references the data. But it doesn't make much sense to cache it more than temporarily, much of the data will probably not be reused, and it exists compressed in the buffer cache.

The squashfs cache is therefore designed to cache only the last couple of metadata and fragment blocks. It is also designed to be simple and extremely fast. The maximum size of the metadata cache is only 64 KiB.

Simplicity and speed is extremely important. The squashfs_metadata_read() wrapper around the cache is designed to step through the metadata a structure at a time (20 or so bytes), updating the read position in the metadata each call, with more metadata cache blocks being read and decompressed as necessary. The common case where the metadata is already in the cache (because we're stepping through it 20 or so bytes at a time), is designed to be extremely fast - a spinlock and array search only. I recently optimised the cache to use spinlocks rather than mutexes and reduced the lock holding time (necessary to move to spinlocks), and this resulted in a 20%+ speed improvement in reading squashfs filesystems.

Given the above using an address space in the page cache will result in greater complexity, more memory overhead, and much slower operation. There's a number of reasons for this.

1. The amount and life-span of the data stored in the page cache is outside of Squashfs' control. As explained above it only makes sense to temporarily cache the last couple of metadata and fragment blocks. Using the page cache (if a 'large' address space is used) for these keeps more of them around for longer than necessary, and will potentially cause more worthy datablocks to be flushed on memory pressure.

2. The address space will be caching uncompressed data, the squashfs references to this data are the compressed locations within the filesystem. There doesn't exist a one-to-one linear mapping from compressed location to decompressed location in the address space. This means a lookup table still needs to be used to store the mapping from compressed location to decompressed location in the address space. Now this lookup table (however implemented) is itself at least as complex as my current cache implementation.

3. Once the locations of the decompressed pages in the address space have been found, they'll need to be looked up in the page cache, and this has to be done for every 4K page. With the default fragment size of 128 KiB this means 32 separate lookups. Somewhat slower than one spinlock and array search per 128 KiB block in the squashfs cache implementation.

Comments, especially those of the form "you've got this completely wrong, and you can use the page cache like this, which will be simpler and faster than your current implementation" welcome :) I'm not adverse to using the page cache, but I can't see how it will be simpler or faster than the current implementation.

Phillip
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/