Re: Soft metadata updates paper w/code

Matthias Urlichs (
24 Jul 1997 23:27:14 +0200

Ingo Molnar <> writes:
> What about the following modification to your scratch copy technique.
> Instead of copying to scratch buffer before write, what about keeping two
> metadata blocks around, modifying the 'kernel copy', and correctly
> propagating changes to the 'disk copy'.

You may not have one true disk copy.

Things do get messy, anyway. Say you make Change 1 to block A which may
only appear on disk until after block B has been written. Then you make
Change 2 to the block which need to be held until block C has been written.
For completeness, let's also assume block A is dirty anyway.

A rollback scheme is, in theory, simple -- if the system writes block A,
you undo change 2 and 1, then you actually write the block, then redo the
changes. If the system writes block B, then you drop change 1 from the
list. If the system writes block C, then you mark change 2 as not-dependent
(and toss it along with change 1 when the disk finally gets around to
writing block B).

An additional advantage with this scheme is that you may be able to correct
some problems on-the-fly if a write fails. For example, let B be an
indirect block and let A be the inode pointing to B. If you can't write B,
you look at the change record for A, see where B is mentioned, replace this
with B' (being a newly allocated block), write the data from B there (just
drop the buffer for B', if any, and change the block number in the buffer
header of B to B'), mark B as kaput, and voila -- no (meta)data corruption
at all when a disk gets flaky, which is _great_. (Umm, question for anybody
who has actually read the paper -- did this idea occur to the authors as
well?) Assuming you don't get read errors, of course. ;-)

With a "disk block vs. memory block" scheme you'd probably need to store
all the intermediate stages as well..?

The problems with all of this really lie with the disk interface. You need
to be absolutely 100% sure there are no reordered writes once the kernel
tells the disk controller to write a block. The kernel cannot queue a block
with pending changes from an interrupt, and the kernel cannot let the file
system code change a block while it happens to be queued. The latter
problem might cause some performance degradation.

Frankly, I'd be more happy with a system that doesn't crash and doesn't
lose power in the first place. ;-) Unfortunately, not everybody can
depend on that.