Re: Status of ReiserFS + Journalling

From: Neil Brown (neilb@cse.unsw.edu.au)
Date: Fri Oct 06 2000 - 04:36:22 EST


On Friday October 6, news-innominate.list.linux.kernel@innominate.de wrote:
> Neil Brown wrote:
> > Suppose, for stripe X the parity device is device 1 and we were
> > updating the block on device 0 at the time of system failure.
> > What had happened was that the new parity block was written out, but
> > the new data block wasn't.
> > Suppose further than when the system come back, device 2 has failed.
> > We now cannot recover the data that was on stripe X, device 2. If we
> > tried, we would xor all the blocks from working devices together and I
> > hope that you can see that this would be the wrong answer. This poor,
> > innocent, block, which hasn't been modified for years, has just been
> > corrupted. Not good for PR.
>
> Now that I'm getting better at thinking about this I can see that a very
> simple journal will protect from this particular problem. A phase-tree
> style approach would likely to the job more efficiently, once again.
> Here's the ultimate simple approach: why not treat an entire stripe as
> one block? That way you never get 'innocent blocks' on your stripe.

yyyeeeeessssssss..... bbbbuuuuuttttttt...

There was a detail (one of many probably) that I skipped in my brief
description of raid5.
Every block on the raid5 array has a two dimensional address
   (discnumber, blocknumber)
This needs to be mapped into the linear address space expected by a
filesystem (unless you have a clever filesystem that understand two
dimensional addressing and copes with holes where the parity blocks
are).
Two extremes of ways to do this are:

  abc- afk-
  de-f bg-o
  g-hi c-lp
  -jkl -hmq
  mno- din-
  qp-r ej-r

where letters are logical block addresses, hyphens are parity blocks,
columns are drives, and rows a physical block numbers.

What is typically done it to define a cluster size, and then address
down the drive for a cluster, and then step across to the next driver
for the next cluster, so with a cluster size of 3, the above array
would be

  adg-
  beh-
  cfi-
  jm-p
  hn-q
  lo-r

(notice that the parity blocks come in clusters like the data blocks).
There is a trade off when choosing cluster size.
A cluster size of 1 (as it in the very first picture above) means that
any sequential access will probably use all drives, and so you should
see appropriate speed ups for read, and you might be able to avoid
reading old data for writes (as when you write a whole stripe you
don't need to read old data to calculate parity).
This is good if you have just a single thread accessing the array.

A large cluster size (e.g. 64k) means that most accesses will use only
one drive (for read) or two drives (for write - data + parity). This
means that multiple threads that access the array concurrently and not
always be tripping over each other (sometimes, but not always) (this
is called 'head contention').

There is a formular that I have seen, but cannot remember, which links
typical IO size, and typical number of concurrent threads to ideal
cluster size.

Issues of drive geometry come into this too. If you are going to read
any of a track, you may as well read all of it. So having a cluster
size that was a multiple of the track size would be good (if only
drives have constant sized tracks!!).

Back to your idea. Having each stripe be one filesystem block means
either having large filesystem blocks (meaning lots of wastage) or
having a cluster size of 1.

Unfortuately, with Linux Software Raid, the minimum cluster size is one
page, and the maximum filesystem block size is one page, so we cannot
try this out on linux to see how it actually works.

My understand of the way WOFL works is that it uses RAID4, so there are
no parity holes to worry about (RAID4 has all the parity on the one
drive) and WOFL knows about the 2-D structure.
It tries to lay out whole files (or large clusters of each file) on to
one disk each, but hopes to have enough files that need writing at one time
that it can write them all, one onto each disc, and thus keep all the
discs busy while writing, but still have reduced head contention when
reading.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Oct 07 2000 - 21:00:18 EST