Re: Big files in ext2fs (but not i_osync)

Matti Aarnio (
Mon, 2 Mar 1998 14:37:34 +0200 (EET)

[ we really should trim down the Cc: lists ... ]

Ingo Molnar wrote:
> On Sun, 1 Mar 1998, Theodore Y. Ts'o wrote:
> > While we're on the subject of making changes to how the ext2 inode
> > stores direct block pointers, the far more efficient change to make is
> > to store extents (triples of starting logical block, starting physical
> > block, number of blocks) in the inode. This works since most files use
> > contiguously allocated blocks, so why not take advantage of that
> > encoding efficiency.
> doesnt this change the speed of bmap() from O(1) to O(log(N))? With the
> current scheme we exactly know where to look to get the index, with
> extents we have to binary-search every table? And, why do we have to store
> both logical and physical blocks, is there any nontrivial translation
> between them?

I have some practical experience with this -- I have code
that does exactly this...

However before dwelling into that, let me remind you about
a few things in the EXT2 blocks:

- The block-size varies from 1k to local system pagesize
(intel: 4k, alpha: 8k, as an example)
- With 1k blocks each indirection block can hold 256 block
addresses. With 4k blocks each can hold 1024.
- Triple-indirection can address N^3+N^2+N+12 blocks, where
N derives as above. With large N, it is O(N^3)

In general terms, if you are playing with HUGE files, you
are using as large blocks as you can (you would be fool not
to do so..)

Now presume N=1024: You can address one gigablocks of 4kB
each, that is, 4 TB file. With N=2048 the triply-indirected
scheme can address more than 2^32 blocks, and changeing block
addresses from 32-bit unsigned value to something else is
such a work that we could as well call the resulting beast
"EXT3" .. (N=2048 -> 4 Gblocks @ 8kB -> 32 TB)

There is NO need to go to 4-way indirected blocks in EXT2.

Therefore the minimal (and IMO sufficient) method for allowing
larger than 4 GB files is to extend file-size storage bits.
(We need 13 new bits for 32 TB, and 10 new bits for 4 TB)

> > The hard part is deciding when an inode's blocks should be encoded as
> > extents, and when it should revert back to the old system. For files
> > which are created with a lot of seeks and jumping around, it may be much
> > more efficient to use the old scheme. Most of the time, though, we're
> > much better off using the extent-based encoding system.
> isnt there a problem with holes, plus we can not 'merge' extents? So this
> seems to be a mostly creation-time decision. Holes have to be encoded as
> single-block extents this way. (we do not know how fragmented a hole will
> be when it gets allocated some time later)

I thought that for a few days at first -- shall I support HOLES ?

There are two kinds of huge files that people have:
- Sparse files
- Dense files ( = contiguous, non-holed )

While playing with ideas (and codes) I came to realize how much
problems it would cause to have existing non-zero size file to
change its i-node coding from indirection to extents. (or back)
I also very quickly found out how awfull swamp is trying to
handle a hole into which appears a new block -- all extent datas
above it need to be moved up...

On a zero-size file, a chattr ioctl can be used to flip around
extent-coding/classical-indirection schemes, but with even one
block in the file makes that impossible to do reliably.
(Consider one process reading, and other doing change on same
i-node.. Reading/writing has no i-node semaphore interlocks.)

Ergo, in my extent code the system allows extending file only
at the end, and if you do a seek to the end of the universe,
code fills all blocks in between the current end, and the new
end with 0-filled blocks.

To be able to code sufficient amounts of extents, I came up
with indirection scheme there too, and while my current code
plays with altering i-node contained extent-triplet meanings
from direct extents to indirect and doubly indirect ones, the
code to handle all that is -- ugly. (and bloody long!)
(Also as current, it is not race free at accessing vs. altering

Searching for addressed block goes like this:

5 triplets in the i-node, last two may be direct, or
indirect ones. Pseudo-code:

datablk = in_range(targetblk, &ino->blkrangecache);
if (datablk != 0)
... found it!

for (i = 0; i < 5; ++i) {
datablk = in_range(targetblk, &ino->blktriplet[i])
if (datablk != 0) {
... do deeper analysis, fill cache
goto find_again;
... block not found

With more than 5 extents, extent-slots 3 and 4 change their
role from direct to indirect ones. Block-start-address in
them point to blocks containing extent triplets.

With sufficient amount of fragmentation, slot 4 becomes
doubly-indirected. (But by that time we are in deep
trouble anyway, and the classical schema could have been
better, overall.)

In all cases, it is trivial to search the target block
within the extents of extent block.
( The in_range() function yields zero when the target block
is not within the range, and address of the block when
the block number is within the range. )

Now presume we have a beautifull contiguous file spanning
30 cylinder groups (or whatever they were called).

Each cylindergroup begins with i-node storage, and some bitmaps,
and thus data cannot be stored contiguously, but it must skip
over those i-node blocks and bitmaps.

Therefore the file has 30 extents, being constructed as
3 in the i-node, and 27 in singly indirect block.

Now search for any block within them. The first search
goes linearly over the i-node contained ranges, and finds
that the block is within the slot 3. (Count starts at 0.)
There are more than 5 extents, thus the range is actually
telling where to go to find an indirect extent block.

From that indirect block the search goes in, and pick the
number of extents in the block from the first u32int_t value
in the block, and binary searches for a range. Though it
could perhaps do linear search as well -- optimal case means
it will search for perhaps up to 30 extents, or about 450
bytes from the begining.

Finding desired range the routine fills in the cache, and
jumps back to cache lookup, and eventual block addressing,

Now the idea with the cache is locality of reference, naturally.
If the next lookup hits the cache also, there is no need to
go to the difficult path of analyzing indirect range blocks.

There are certain difficulties in handling extending with
multiple active updaters, but those are manageable with i_sem
semaphore handling. File extending with this scheme MUST
be singly threaded! Reads and writes within existing file
are free to proceed as they see fit.

Incidentally, extent cache works also with the classical
indirect block schemes. When looking up for a block in
the last indirect block, it is trivial to scan the block
addresses in it to see, if they are contiguous, and in
successfull case, filling the cache so that it covers
until the end of the contiguous range visible in the
indirect block (starting from the current block number).

An other way to do extent coding would be to store extent
data into B-tree structures. There range inserts/merges
will not be too difficult, but will require carefull
interlocks.. (I have forgotten who suggested this one.)

> -- mingo

/Matti Aarnio <>

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to