Re: Is any file system on Linux appropriate for very large di

Kai Henningsen (kai@khms.westfalen.de)
23 Jul 1996 09:16:00 +0200


torvalds@cs.helsinki.fi (Linus Torvalds) wrote on 19.07.96 in <Pine.LNX.3.91.960719121152.296A-100000@linux.cs.Helsinki.FI>:

> Oh, you can lock the whole directory for the duration of a operation, but
> then you run into performance problems due to that (anybody say "news
> serving"?) and you also get the deadlock issues (rename() is "fun" when
> you have two processes trying to move files between two different
> directories).

We *could* use the hash routines I wrote about. Shrinking a directory
isn't really supported, but then it isn't for ext2, either. The path to an
entry is *always constant* in length (in my implementation: do the hash;
lookup (hash mod bdirsize) in bucket directory; fetch that bucket;
(linearly) search for hash in bucket; if found, use filepos to fetch entry
in data file, otherwise it's not there).

Expanding the directory is fairly easy. You locate the correct bucket and
see if you can fit the new entry in. If not, you split that bucket in two.

This may force you to double the bucket directory, but that simply means
appending an identical copy to the end - and note that I have about 100 MB
worth of message ids with a 16 KB bucket directory (and 20 MB worth of
buckets [which contain two longs [hash and data offset in the data file]
per entry, in 4 KB pages], low overhead indeed).

Ok, you would get into trouble if you get so many entries with the exact
same hash that you can't fit them in one bucket, as entries with the same
hash _always_ have to go into the same bucket. Of course, you could work
around this with one of the more traditional collision handling variants,
but I believe this is sufficiently rare that you don't need to care, if
the buckets are large enough. I have yet to see it happen. In my
implementation, it would take 512 *identical* hashes. Of course, you need
a sensible hash function.

And deleting an entry reduces to deleting it from its bucket.

In all these cases, you can certainly avoid writing anything until you
have allocated the space. And splitting buckets does not happen very
often, if your buckets are large enough; and doubling the bucket directory
happens seldom enough that blocking for a while won't be a problem,
either.

Note also that you can repair the hash data by simply rebuilding it from
the actual entries. There's no need to try to reshuffle the structure in
fsck.

My current (not fs related) implementation is about 250 lines Pascal :-)

MfG Kai