Re: Soft metadata updates paper w/code

Ingo Molnar (mingo@pc7537.hil.siemens.at)
Fri, 25 Jul 1997 11:44:45 +0200 (MET DST)


On 24 Jul 1997, Matthias Urlichs wrote:

[...]
> An additional advantage with this scheme is that you may be able to correct
> some problems on-the-fly if a write fails. For example, let B be an
> indirect block and let A be the inode pointing to B. If you can't write B,
> you look at the change record for A, see where B is mentioned, replace this
> with B' (being a newly allocated block), write the data from B there (just
> drop the buffer for B', if any, and change the block number in the buffer
> header of B to B'), mark B as kaput, and voila -- no (meta)data corruption
> at all when a disk gets flaky, which is _great_. [...]

you can get the same by putting a block mapping layer between the
filesystem and the device. This has to be done carefully to be fast, but
it doesnt look impossible. [i plan to implement this for the RAID layer]
This has the advantage that it's filesystem-independent. [the disadvantage
is that there is a mapping ... although the latency of this should vanish
in the noise, and should be zero for the normal case]

and you could optionally 'merge' a mapping and a filesystem (offline), ie.
shuffle the filesystem on umount, to get 1:1 mapping again?

Another possible advantage of letting this be done on the device level is
that some blocks are redundant (RAID-1), and if you remap on the
filesystem level, you loose all blocks, if you remap per-device, you loose
only one block (and the effect is invisible to the filesystem). [but this
stuff is no way common enough to be optimized to death like this ;)]

> With a "disk block vs. memory block" scheme you'd probably need to store
> all the intermediate stages as well..?

the 'change' is a small structure. We have the 'latest and gratest', and
the 'synchron' copy. So the intermediate stages are well present, but we
only need the 'latest' for direct C type memory access, and the on-disk
copy to do effective DMA.

the way leading from the on-disk to the latest copy is represented through
those 'modification structures'. They are only applied when a write has
finished. [and are applied according to dependencies]

> The problems with all of this really lie with the disk interface. You need
> to be absolutely 100% sure there are no reordered writes once the kernel
> tells the disk controller to write a block. The kernel cannot queue a block
> with pending changes from an interrupt, and the kernel cannot let the file
> system code change a block while it happens to be queued. The latter
> problem might cause some performance degradation.

write reordering is not a problem. I think we only apply changes and start
writeouts when they are safe, according to the on-disk copy. (ie. driven
by the 'IO finished' interrupt)

> Frankly, I'd be more happy with a system that doesn't crash and doesn't
> lose power in the first place. ;-) Unfortunately, not everybody can
> depend on that.

well thats how Evolution works, things crash every now and then ;) We just
try to give it less chance to mess things up. And soft updates save the
system, independently of _what_ the cause of system interruption was,
power loss, or 2.1.46, or another cup of coffee in the keyboard or kids ;)

[it can only protect against interruption damage, other things are needed
to protect against other types of failures]

-- mingo