Re: Soft metadata updates paper w/code

Colin Plumb (colin@nyx.net)
Thu, 24 Jul 97 17:15:32 MDT


Ted Ts'o wrote:
> The report talks about undoing changes and then redoing them after the
> write succeeds, so I assume that during the duration of the write,
> access to that disk block is locked out. This could be a contention
> issue for heavily accessed directories (like /tmp) or block bitmaps.

Dumb question: what does the system do now? Does it risk having the
device DMA capture a block image halfway through a modification? I
know it's done that way for mmap(), because there's nothing else that
can be done, but I'f I'm shuffling free space in an ext2 directory
block, it could get pretty ugly at the halfway point.

> The alternative approach would be to copy the block to scratch space and
> modify the scratch copy, and then let the device driver write that out
> to disk.

A third alternative: undo in place, but if you need access while the
write is pending, *then* make a scratch copy.

And a fourth: have the file system be clever enough in its
parse_metadata(bh) function to walk the redo list.

> Either way, it requires pretty extensive changes and support all over
> the block device interface, and (perhaps) the generic filesystem layer.
> But the general approach is certainly worth keeping in mind. (Which is
> another way of saying I don't have time to rush out and implement it
> right now; maybe later.)

I think it belongs in the generic filesystem layer. The details of the
undo/redo list items obviously have to be fs-specific, but the basic
write-ordering operations can be generic, and so should be.

It's easy to give the file system the option of avoiding the copy in the
busy-buffer case, by having a special find_block_possibly_with_stuff_to_redo
call that doesn't block (or copy, whichever default you pick).

Simple ordered writes wouldn't be that hard, but it would be better to
queue a lot of blocks to the device driver and then let it pick which
ones to send out and walk the undo list to the appropriate point.
At this point, the locking gets "interesting".

You'd want to keep track, when making changes, of whether a block is
dirty at the beginning of its undo list (i.e. if we undid everything to
this block, would there be any point in writing it out?). Such blocks
should not be queued for writing until the situation changes, lest the
device driver engage in unproductive behaviour.

And as for Matthias' idea of recovering from flaky disks - it
could get a bit confusing to the executing programs to have their
files disappear from under them ("wups, sorry, I *can't* create that
inode!"), but there are possibilities.

-- 
	-Colin