Re: Is any file system on Linux appropriate for very large

Mike Black (mblack@csihq.com)
Sat, 20 Jul 1996 09:44:58 -0400


>On Thu, 18 Jul 1996, Linus Torvalds wrote:
>> That's not necessarily true. Certainly, insert/delete becomes more
>> complex to code, but it can be made very substantially faster as a
>> result.
>
>Umm.. I'll agree. In theory.
>
>However, the discussion is pretty much academic until somebody actually
>does it, and even then I wouldn't trust the code to do the right thing
>for the next year or two.
>
>What makes especially insert so non-trivial is not so much the actual
>operations themselves, but the fact that you no longer operate on a local
>scale: one modification may modify a noticeable amount of the directory.
>
>For example, doing a priority queue in memory is trivial. Trying to do
>the same thing on a directory is complex as h*ll, because you don't ever
>want the directory to be in a inconsistent state, yet at the same time
>the IO operations force you to sleep..
>
>Oh, you can lock the whole directory for the duration of a operation, but
>then you run into performance problems due to that (anybody say "news
>serving"?) and you also get the deadlock issues (rename() is "fun" when
>you have two processes trying to move files between two different
>directories).
>
>Contrasted to that, a linked list is trivial: all operations are local. We
>do a write lock on the directory for the duration of the operation, but that
>write lock doesn't affect readers, so news is happy. And the actual directory
>change can be done atomically (in fact, it's hard _not_ to), so rename() is
>not a large problem (rename is still a horrible, but orders of magnitudes
>easier than it would be with any "global state" directory structure).
>
>Also, "every problem is easy if you don't check for errors". Some of the
>global state structures will need to split blocks, and a single split can
>result in more splits lower down - in which case you may have succeeded
>with the first split, only to notice that you're out of disk blocks for
>the second one, and then you have to undo the operation.. (I'm told that
>HFS does this?)
>
>I'm not saying that a linked list (or array, rather) is technically
>superior, but I _am_ saying that it may be the better choice despite the
>limitations.
>

I'll throw my 2 cents in here by bringing up the possibility of using Skip
Lists instead of B-Trees or linked lists. I've done some benchmarking and
have found skip lists to be immensedly (order of magnitude or so) faster
than B-Trees. Skip lists are easy-to-write and operate on a local scale.
For those of you unfamiliar with them skip lists are a probabilistic linked
list. Inserts never require resorting and reads are faster then a B-Tree.
I tested inserts against B-Tree's and found that skip lists were 8 times
faster. Skip lists really start excelling at larger numbers of records.

A paper with more detail is at http://www.csihq.com/analysis/skiplist.html
-
/----------------------------------------------------------\
| Mike Black mblack@csihq.com 407-676-5118, x203 |
| Computer Science Innovations, Inc. 407-676-2355 FAX |
| Melbourne FL 32904-2314 http://www.csihq.com |
\----------------------------------------------------------/