Re: BIG files & file systems

From: Hans Reiser (
Date: Fri Aug 02 2002 - 10:10:30 EST

There are a number of interfaces that need expansion in 2.5. Telldir
and seekdir would be much better if they took as argument some
filesystem specific opaque cookie (e.g. filename). Using a byte offset
to reference a directory entry that was found with a filename is an
implementation specific artifact that obviously only works for a
ufs/s5fs/ext2 type of filesystem, and is just wrong.

4 billion files is not enough to store the government's XML databases in.


Steve Lord wrote:

>On Fri, 2002-08-02 at 08:56, Jan Harkes wrote:
>>I was simply assuming that any filesystem that is using iget5 and
>>doesn't use the simpler iget helper has some reason why it cannot find
>>an inode given just the 32-bit ino_t.
>In XFS's case (remember, the iget5 code is based on XFS changes) it is
>more a matter of the code to read the inode sometimes needing to pass
>other info down to the read_inode part of the filesystem, so we want to
>do that internally. XFS can have 64 bit inode numbers, but you need more
>than 1 Tbyte in an fs to get that big (inode numbers are a disk
>address). We also have code which keeps them in the bottom 1 Tbyte
>which is turned on by default on Linux.
>>This is definitely true for Coda, we have 96-bit file identifiers.
>>Actually my development tree currently uses 128-bit, it is aware of
>>multiple administrative realms and distinguishes between objects with
>>FID 0x7f000001.0x1.0x1 in different administrative domains. There is a
>>hash-function that tries to map these large FIDs into the 32-bit ino_t
>>space with as few collisions as possible.
>>NFS has a >32-bit filehandle. ReiserFS might have unique inodes, but
>>seems to need access to the directory to find them. So I don't quickly
>>see how it would guarantee uniqueness. NTFS actually doesn't seem to use
>>iget5 yet, but it has multiple streams per object which would probably
>>end up using the same ino_t.
>>Userspace applications should either have an option to ignore hardlinks.
>>Very large filesystems either don't care because there is plenty of
>>space, don't support them across boundaries that are not visible to the
>>application, or could be dealing with them them automatically (COW
>>links). Besides, if I really have a trillion files, I don't want 'tar
>>and friends' to try to keep track of all those inode numbers (and device
>>numbers) in memory.
>>The other solution is that applications can actually use more of the
>>information from the inode to avoid confusion, like st_nlink and
>>st_mtime, which are useful when the filesystem is still mounted rw as
>>well. And to make it even better, st_uid, st_gid, st_size, st_blocks and
>>st_ctime, and a MD5/SHA checksum. Although this obviously would become
>>even worse for the trillion file backup case.
>If apps would have to change then I would vote for allowing larger
>inodes out of the kernel in an extended version of stat and getdents.
>I was going to say 64 bit versions, but if even 64 is not enough for
>you, it is getting a little hard to handle.


- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to More majordomo info at Please read the FAQ at

This archive was generated by hypermail 2b29 : Wed Aug 07 2002 - 22:00:19 EST