Re: Filesystem optimization..

Michael O'Reilly (michael@metal.iinet.net.au)
Wed, 07 Jan 1998 09:05:21 +0800

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Albert Cranford: "2.1.78-fix for sysv as module"
Previous message: Kevin Lentin: "Re: PROPOSAL: /proc/dev"

In message <199801062215.WAA01963@dax.dcs.ed.ac.uk>, "Stephen C. Tweedie" write
s:
> > Hmm. I think there's a few misconceptions here. The main problem with
> > such a large fileset (from what I've been able to measure) is not the
> > directory lookup, but the actual open of the file itself. The
> > directories mostly get cached, but the file inode set is way too big
> > to cache.
>
> Yes. The problem is that if you inline it, the total size --- inodes
> plus directories --- isn't going to change in the slightest, so you're
> just going to end up evicting a whole truckload of directory data from
> the cache. :( You can't have it both ways, unfortunately.

Yes and no. The ext2 filesystem must overprovision the inodes because
it can't add more later. Allocating them on demand would mean a lower
total space.

The other issue is that we're going to get cache misses anyway. What
I'm trying to do is speed up a cache miss.

When doing a path lookup, the directory must already have been
read. Embedding the inode has the neat advantage that reading the
directory automatically pulls the inode in at a very low cost.

> > The terrible cache locality means that we need do physical
> > I/O to read the file inode a relatively huge percentage of the time.
>
> If you can't cache dir+inodes, you'll still need to do the seeks. If
> the directories are badly fragmented, you'll need to do even more of
> them. It's likely that the top 2 or 3 levels of the directory tree
> will remain fully cached, but if the lowest levels in the tree are not
> completely cached, and if they contain in the order of 100 entries
> each, then preallocating those directories to reduce fragmentation
> ought to be a huge performance gain. The seeks for inodes further up
> the tree shouldn't matter --- if we can cache all of those
> directories, then we can cache their inodes too. At the lowest level,
> each inode is only one seek, but a fragmented 100-entry directory
> could easily be five or ten!

Nod. And yes, the top level directories normally get cached. The
problem is that a cache miss on the lowest level directory does:
seek to directory.
read
seek to inode
read
seek to first block
read/write

My thinking is that the middle seek is not needed if you can embed the
inode in the directory. Just cuts that seek out altogether.

I can't think of many times the inode is read that the directory isn't
read before hand... :)

> > The inode emedding in the directory doesn't affect the permissions
> > check at all.
>
> True, but that wasn't the point I was trying to make. The trouble is,
> we're just using multi-level directory hierarchies to fake tree lookup
> because the filesystem doesn't handle single huge directories itself
> very well. If we can get the filesystem to deal with the tree
> internally by implementing btree directories, then we get the same
> performance boost or better, but we no longer have a permission check
> and an inode lookup at every node in the tree. That has _got_ to
> speed things up enormously, as well as eliminating a lot of inode
> caching for branch nodes in the tree.

Indeed. That's a bit more ambitious tho. B-trees bring up a lot of
allocation issues that haven't (IMHO) been as well studied as the
ext2/ffs type allocation strategies.

I was trying to leverage of the ext2 simplicity and speed, with a
relatively minor change. (no changes to data block allocation etc. The
only difference is that inodes are created on-the-fly rather than
allocated from a pre-built pool).

The b-tree issue is also fairly orthogonal to the current issue. Even
in a b-tree scheme, you've still got to decide where you put the
inodes. Do you embed them in the b-tree or in a seperate inode block?

> It may well help, actually. If you fill a directory simply by
> creating a lot of files in it, then ext2fs will try to place the files
> in the same block group as the parent directory. It will allocate one
> directory block, then as files are created it will create as many file
> data blocks as it can, as sequentially as possible, until the
> directory gets extended --- at which point it will allocate another
> directory block after those files' allocations. This is in fact a
> sure fire way to get directory fragmentation, and would benefit
> greatly from the patch.

ah-hah! I'll give the patch a whirl.

> Cheers,
> Stephen.

Next message: Albert Cranford: "2.1.78-fix for sysv as module"
Previous message: Kevin Lentin: "Re: PROPOSAL: /proc/dev"