Re: Alternate solutions (Was: Re: NFS still has caching problem)

Olaf Titz (
Tue, 16 Jul 96 12:08 +0200

Path: not-for-mail
From: Olaf Titz <>
Subject: Re: Is any file system on Linux appropriate for very large
Message-ID: <>
Date: 16 Jul 1996 12:08:34 +0200
References: <>
Organization: private Linux site, southern Germany
Lines: 71

Drew Eckhardt <> wrote:
> 1. Any reasonable unix implementation (this includes Linux with the
> ext2fs) uses a name-to-inode translation cache, so sequential
> directory lookups within the kernel are usually avoided.

Unless you need to access many files in sequence on the user level.
This is what HPUX 10's braindamaged login does (user info a.k.a.
shadow database is stored in one file per user, presumably a security
requirement; big loss if you have 10,000 users on your system) and
also the access pattern on a news spool.

> 2. In some cases (for example, within a news reader) it may be
> desireable for implementation or user interface purposes (ie,
> having a directory layout which mirrors the newsgroup hierchy
> is quite intuitive, and opting for one or more directories of
> files facilitates incremental updates using standard tools like

The standard news spool layout is article 1234 in gets stored
in the file comp/foo/1234. However, there is no _requirement_ to do
so; you could easily split it up like comp/foo/1/234 or comp/foo/12/34
or whatever. This would require a change in perhaps two or three
places (assuming you have an integrated server system like INN; it
would however break news readers which read from the spool instead via

INN does indeed suffer from very large spool directories. The access
pattern on a news spool consists of irregular random read accesses and
fast sequential create accesses. These are enough to break every inode
cache. The overview file (usually a file named .overview in each spool
directory) gets frequent read and write accesses, so its inode should
be cached, but practice shows it indeed helps to locate the overview
files in a different directory.

This all boils down to the question "which data structure do you use
for directories". Un*x filesystems employ a linear list, which is
guaranteed to perform badly for large directories no matter how big
your inode cache is. The Right (tm) data structure for a directory
would be a B* tree, as to be found in every textbook on the subject.
Mac HFS has this. Unfortunately, it is harder to implement.

Splitting the directory layout as described above (or like with
terminfo, etc) effectively reduces the structure to a tree.
This is not as efficient as getting it right in the kernel but often a
workable solution.

> If you're going to do this, you need to be aware that the standard
> POSIX interface to directories (opendir()/readdir()/closedir())
> only provides for sequential access to an unsorted set of entries,
> and that lookups using this interface are going to be O(n).

You have random access to all file info using stat() etc.; the
readdir() thing is for listing a directory, which will always be O(n)
no matter what the underlying data structures are. The concern is
whether you need to list the directory at all.

> If the performance problem is access to the directory entries rather
> than opening files, you want to create a higher level abstraction than
> POSIX offers that uses some form of caching. In the fast, robust, and

What would you need this for? Either you want a list of all available
files (with or without additional info), then you have to read it in
anyway. Or you want the stat info for particular files, which you
_could_ get via stat() every time _if_ it were efficient in the


___ or @{stud,informatik}       ____
__ o           <URL:>     <IRC:praetorius>
__/<_              >> Just as long as the wheels keep on turning round
_)>(_)______________ I will live for the groove 'til the sun goes down << ____