Re: RFC - new raid superblock layout for md driver

From: Kenneth D. Merry (ken@kdm.org)
Date: Thu Nov 21 2002 - 19:08:23 EST


On Wed, Nov 20, 2002 at 10:03:26 +0000, Anton Altaparmakov wrote:
> Hi,
>
> On Wed, 20 Nov 2002, Neil Brown wrote:
> > I (and others) would like to define a new (version 1) format that
> > resolves the problems in the current (0.90.0) format.
> >
> > The code in 2.5.lastest has all the superblock handling factored out so
> > that defining a new format is very straight forward.
> >
> > I would like to propose a new layout, and to receive comment on it..
>
> If you are making a new layout anyway, I would suggest to actually add the
> complete information about each disk which is in the array into the raid
> superblock of each disk in the array. In that way if a disk blows up, you
> can just replace the disk use some to be written (?) utility to write the
> correct superblock to the new disk and add it to the array which then
> reconstructs the disk. Preferably all of this happens without ever
> rebooting given a hotplug ide/scsi controller. (-;
>
> >From a quick read of the layout it doesn't seem to be possible to do the
> above trivially (or certainly not without help of /etc/raidtab), but
> perhaps I missed something...
>
> Also, autoassembly would be greatly helped if the superblock contained the
> uuid for each of the devices contained in the array. It is then trivial to
> unplug all raid devices and move them around on the controller and it
> would still just work. Again I may be missing something and that is
> already possible to do trivially.

This is a good idea. Having all of the devices listed in the metadata on
each disk is very helpful. (See below for why.)

Here are some of my ideas about the features you'll want out of a new type
of metadata:

[ these you've already got ]

 - each array has a unique identifier (you've got this already)
 - each disk/partition/component has a unique identifier (you've got this
   already)
 - a monotonically increasing serial number that gets incremented every
   time you write out the metadata (you've got this, the 'events' field)

[ these are features I think would be good to have ]

 - Per-array state that lets you know whether you're doing a resync,
   reconstruction, verify, verify and fix, and so on. This is part of the
   state you'll need to do checkpointing -- picking up where you left off
   after a reboot during the middle of an operation.

 - Per-array block number that tells you how far along you are in a verify,
   resync, reconstruction, etc. If you reboot, you can, for example, pick
   a verify back up where you left off.

 - Enough per-disk state so you can determine, if you're doing a resync or
   reconstruction, which disk is the target of the operation. When I was
   doing a lot of work on md a while back, one of the things I ran into is
   that when you do a resync of a RAID-1, it always resyncs from the first
   to the second disk, even if the first disk is the one out of sync. (I
   changed this, with Adaptec metadata at least, so it would resync onto
   the correct disk.)

 - Each component knows about every other component in the array. (It
   knows by UUID, not just that there are N other devices in the array.)
   This is an important piece of information:
        - You can compose the array now, using each disk's set_uuid and the
          position of the device in the array, and by using the events
          field to filter out the older of two disks that claim the same
          position.

          The problem comes in more complicated scenarios. For example:
                - user pulls one disk out of a RAID-1 with a spare
                - md reconstructs onto the spare
                - user shuts down machine, pulls the (former) spare that is
                  now part of the machine, and replaces the disk that he
                  originally pulled.

          So now you've got a scenario where you have a disk that claims to
          be part of the array (same set_uuid), but its events field is a
          little behind. You could just resync the disk since it is out of
          date, but still claims to be part of the array. But you'd be
          back in the same position if the user pulls the disk again and
          puts the former spare back in -- you'd have to resync again.

          If each disk had a list of the uuids of every disk in the array,
          you could tell from the disk table on the "freshest" disk that
          the disk the user stuck back in isn't part of the array, despite
          the fact that it claims to be. (It was at one point, and then
          was removed.) You can then make the user add it back explicitly,
          instead of just resyncing onto it.

 - Possibly the ability to setup multilevel arrays within a given piece of
   metadata. As far as multilevel arrays go, there are two basic
   approaches to the metadata:
        - Integrated metadata defines all levels of the array in a single
          chunk of metadata. So, for example, by reading metadata off of
          sdb, you can figure out that it is a component of a RAID-1 array,
          and that that RAID-1 array is a component of a RAID-10.

          There are a couple of advantages to integrated metadata:
                - You can keep state that applies to the whole array
                  (clean/dirty, for example) in one place.
                - It helps in autoconfiguring an array, since you don't
                  have to go through multiple steps to find out all the
                  levels of an array. You just read the metadata from one
                  place on one disk, and you've got

          There are a couple of disadvantages to integrated metadata:
                - Possibly reduced/limited space for defining multiple
                  array levels or arrays with lots of disks. This is not a
                  problem, though, given sufficient metadata space.

                - Marginally more difficulty handling metadata updates,
                  depending on how you handle your multilevel arrays. If
                  you handle them like md currently does (separate block
                  devices for each level and component of the array), it'll
                  be pretty difficult to use integrated metadata.

        - Recursive metadata defines each level of the array separately.
          So, for example, you'd read the metadata from the end of a disk
          and determine it is part of a RAID-1 array. Then, you configure
          the RAID-1 array, and read the metadata from the end of that
          array, and determine it is part of a RAID-0 array. So then you
          configure the RAID-0 array, look at the end, fail to find
          metadata, and figure out that you've reached the top level of the
          array.

          This is almost how md currently does things, except that it
          really has no mechanism for autoconfiguring multilevel arrays.

          There are a couple of advantages to recursive metadata:
                - It is easier to handle metadata updates for multilevel
                  arrays, especially if the various levels of the array are
                  handled by different block devices, as md does.

                - You've potentially got more space for defining disks as
                  part of the array, since you're only defining one level
                  at a time.

          There are a couple of disadvantages to recursive metadata:
                - You have to have multiple copies of any state that
                  applies to the whole array (e.g. clean/dirty).

                - More windows of opportunity for incomplete metadata
                  writes. Since metadata is in multiple places, there are
                  more opportunities for scenarios where you'll have
                  metadata for one part of the array written out, but not
                  another part before you crash or a disk crashes...etc.

I know Neil has philosophical issues with autoconfiguration (or perhaps
in-kernel autoconfiguration), but it really is helpful, especially in
certain situations.

As for recursive versus integrated metadata, it would be nice if md could
handle autoconfiguration with either type of multilevel array. The reason
I say this is that Adaptec HostRAID adapters use integrated metadata.
So if you want to support multilevel arrays with md on HostRAID adapters,
you have to have support for multilevel arrays with integrated metadata.

When I did the first port of md to work on HostRAID, I pretty much had to
skip doing RAID-10 support because it wasn't structurally feasible to
autodetect and configure a multilevel array. (I ended up doing a full
rewrite of md that I was partially done with when I got laid off from
Adaptec.)

Anyway, if you want to see the Adaptec HostRAID support, which includes
metadata definitions:

http://people.freebsd.org/~ken/linux/md.html

The patches are against 2.4.18, but you should be able to get an idea of
what I'm talking about as far as integrated metadata goes.

This is all IMO, maybe it'll be helpful, maybe not, but hopefully it'll be
useful to consider these ideas.

Ken

-- 
Kenneth Merry
ken@kdm.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Nov 23 2002 - 22:00:38 EST