Re: safe file systems

Andrej Presern (
Thu, 25 Sep 1997 10:40:56 +0200

Alan Cox wrote:
> > Do you think it would be possible to build a safe, slow file system?
> > By safe, I mean that I could hit reset in the middle of 50 parallel
> > un-tars and reboot the system and the file system comes up clean (no fsck,
> > but data loss)?
> I dont think you can do it without an fsck. I think you can do it with
> a minimal "always works" fsck done occasionally only.
> Assume you write your minixfs or ext2fs such that you do
> Create/Write
> if need be allocate inode, write it as blank
> allocate blocks
> write block allocation table update
> write blocks to inode
> sync-point
> write data to blocks
> write file extent info
> sync-point
> write file length field + inode data + mark inode not deleted
> add to directory if needed (same rules as the above)
> Delete
> Unlink directory entry
> sync-point
> Write inode as deleted
> sync-point
> Write blocks as free
> Then its slow to write but you know that the fail cases are simple -
> if its in a directory its valid, has an inode and data. If its not in a
> directory then it is free but might be missing from inode/data/extent maps.
> That means your fsck is really a free space hoover.

I'm not an ext2 fs expert but here's my 2 cents on automatically validating
disks: always have two bitmaps, both ending (this is important) with some
sort of a counter. The bitmap with the higher counter value is the 'currently
active one'. When the currently active bitmap has changed, increase the
counter and write the bitmap to the blocks where the second (currently
inactive) bitmap is located. As the counter is being written at the end,
you can always be sure which bitmap is the valid (current) one by checking
the counter value (if the write is interrupted before the counter is written,
the first bitmap stays the valid one).

You can apply this approach to almost anything else you have. You just need
two copies of 'whatever you have', ending with a counter. Write the one that
has the lower counter value, then write the counter. If the counter writes
successfully, this is now your current 'whatever you have', otherwise the
other one stays the current one.

The counter needs only accept 3 values (0, 1 (>0), 2 (>1), *wrap* 0 (>2))
so 2 bits suffice..

This approach presumes that only one process is doing the updates.

> > Has anyone thought about this very much? If so, is there a mailing list or
> > archive that I can browse?
> I have been - its important for embedded Linux boxes. Im stuffed right now
> because I have no real way of forcing that down to disk order of writes. In
> fact if my tests are right then IDE drives are themselves re-ordering my I/O
> requests sometimes.

..which gets into way of the previously described algorithm..

> The other problem is turning off an IDE drive during a write can create
> permanent bad blocks. (Take an old 40Mb drive and yank its power a few times)
> so an fsck or cleanup has to do some kind of remap around those.

I think that the propper way is that the permanent bad blocks should be
discovered and remapped at access time.