Re: Is any file system on Linux appropriate for very large directories?

Linus Torvalds (torvalds@cs.helsinki.fi)
Fri, 19 Jul 1996 12:34:21 +0300 (EET DST)


On Thu, 18 Jul 1996, Stephen C. Tweedie wrote:
>
> > The problem with hashed directories (or any other kind of "smart"
> > directories) is not the lookup phase: that's pretty simple, and is
> > almost always speeded up by the hashing (or the hashing is done
> > badly).
>
> > The _real_ complexity comes in when you start adding/deleting
> > entries. You may get faster lookups, but you're almost sure to get
> > slower modifications, and what's worse - keeping track of the extra
> > state is making the whole thing more complex.
>
> That's not necessarily true. Certainly, insert/delete becomes more
> complex to code, but it can be made very substantially faster as a
> result.

Umm.. I'll agree. In theory.

However, the discussion is pretty much academic until somebody actually
does it, and even then I wouldn't trust the code to do the right thing
for the next year or two.

What makes especially insert so non-trivial is not so much the actual
operations themselves, but the fact that you no longer operate on a local
scale: one modification may modify a noticeable amount of the directory.

For example, doing a priority queue in memory is trivial. Trying to do
the same thing on a directory is complex as h*ll, because you don't ever
want the directory to be in a inconsistent state, yet at the same time
the IO operations force you to sleep..

Oh, you can lock the whole directory for the duration of a operation, but
then you run into performance problems due to that (anybody say "news
serving"?) and you also get the deadlock issues (rename() is "fun" when
you have two processes trying to move files between two different
directories).

Contrasted to that, a linked list is trivial: all operations are local. We
do a write lock on the directory for the duration of the operation, but that
write lock doesn't affect readers, so news is happy. And the actual directory
change can be done atomically (in fact, it's hard _not_ to), so rename() is
not a large problem (rename is still a horrible, but orders of magnitudes
easier than it would be with any "global state" directory structure).

Also, "every problem is easy if you don't check for errors". Some of the
global state structures will need to split blocks, and a single split can
result in more splits lower down - in which case you may have succeeded
with the first split, only to notice that you're out of disk blocks for
the second one, and then you have to undo the operation.. (I'm told that
HFS does this?)

I'm not saying that a linked list (or array, rather) is technically
superior, but I _am_ saying that it may be the better choice despite the
limitations.

I'd be more than happy to see more advanced filesystems for Linux, but just
as an example: people have been talking about log-based filesystems for years
and years, and nothing has come out of it so far. The BSD4.4 LFS is badly
broken according to all accounts, and I haven't heard anybody say they'd fix
it.

The interest is obvious there, but right now all of the better filesystems
seem to have access latencies that can be counted in years or decades rather
than microseconds.

Linus