Re: RFC - new raid superblock layout for md driver

From: Steven Dake (sdake@mvista.com)
Date: Wed Nov 20 2002 - 19:35:59 EST


Lars Marowsky-Bree wrote:

>>The md driver in linux uses a 'superblock' written to all devices in a
>>RAID to record the current state and geometry of a RAID and to allow
>>the various parts to be re-assembled reliably.
>>
>>The current superblock layout is sub-optimal. It contains a lot of
>>redundancy and wastes space. In 4K it can only fit 27 component
>>devices. It has other limitations.
>>
>>
>
>Yes. (In particular, getting all the various counters to agree with each other
>is a pain ;-)
>
>Steven raises the valid point that multihost operation isn't currently
>possible; I just don't agree with his solution:
>
>- Activating a drive only on one host is already entirely possible.
> (can be done by device uuid in initrd for example)
>
>
This technique doesn't work if autostart is set (the partition type is
tagged as a RAID volume) or if the user is stupid and starts the wrong
uuid by accident. It also requires the user to keep track of which
uuids are used by which hosts, which is a pain. Trust me, users will
start the wrong RAID volume and have a hard time keeping track of the
right UUIDs to asssemble. The technique I use ensures that the RAID
volumes can all be set to autostart and only the correct volumes will be
started on the correct host.

>- Activating a RAID device from multiple hosts is still not possible.
> (Requires way more sophisticated locking support than we currently have)
>
>
The only application where having a RAID volume shareable between two
hosts is useful is for a clustering filesystem (GFS comes to mind).
 Since RAID is an important need for GFS (if a disk node fails, you
don't want ot loose the entire filesystem as you would on GFS) this
possibility may be worth exploring.

Since GFS isn't GPL at this point and OpenGFS needs alot of work, I've
not spent the time looking at it.

Neil have you thought about sharing an active volume between two hosts
and what sort of support would be needed in the superblock?

Thanks
-steve

>However, for none-RAID devices like multipathing I believe that activating a
>drive on multiple hosts should be possible; ie, for these it might not be
>necessary to scribble to the superblock every time.
>
>(The md patch for 2.4 I sent you already does that; it reconstructs the
>available paths fully dynamic on startup (by activating all paths present);
>however it still updates the superblock afterwards)
>
>Anyway, on to the format:
>
>
>
>>The code in 2.5.lastest has all the superblock handling factored out so
>>that defining a new format is very straight forward.
>>
>>I would like to propose a new layout, and to receive comment on it..
>>
>>My current design looks like:
>> /* constant array information - 128 bytes */
>> u32 md_magic
>> u32 major_version == 1
>> u32 feature_map /* bit map of extra features in superblock */
>> u32 set_uuid[4]
>> u32 ctime
>> u32 level
>> u32 layout
>> u64 size /* size of component devices, if they are all
>> * required to be the same (Raid 1/5 */
>> u32 chunksize
>> u32 raid_disks
>> char name[32]
>> u32 pad1[10];
>>
>>
>
>Looks good so far.
>
>
>
>> /* constant this-device information - 64 bytes */
>> u64 address of superblock in device
>> u32 number of this device in array /* constant over reconfigurations
>> */
>> u32 device_uuid[4]
>>
>>
>
>What is "address of superblock in device" ? Seems redundant, otherwise you
>would have been unable to read it, or am missing something?
>
>Special case here might be required for multipathing. (ie, device_uuid == 0)
>
>
>
>> u32 pad3[9]
>>
>> /* array state information - 64 bytes */
>> u32 utime
>>
>>
>
>Timestamps (also above, ctime) are always difficult. Time might not be set
>correctly at any given time, in particular during early bootup. This field
>should only be advisory.
>
>
>
>> u32 state /* clean, resync-in-progress */
>> u32 sb_csum
>> u64 events
>> u64 resync-position /* flag in state if this is valid)
>> u32 number of devices
>> u32 pad2[8]
>>
>> /* device state information, indexed by 'number of device in array'
>> 4 bytes per device */
>> for each device:
>> u16 position /* in raid array or 0xffff for a spare. */
>> u16 state flags /* error detected, in-sync */
>>
>>
>
>u16 != u32; your position flags don't match up. I'd like to be able to take
>the "position in the superblock" as a mapping here so it can be found in this
>list, or what is the proposed relationship between the two?
>
>
>
>>The interpretation of the 'name' field would be up to the user-space
>>tools and the system administrator.
>>I imagine having something like:
>> host:name
>>where if "host" isn't the current host name, auto-assembly is not
>>tried, and if "host" is the current host name then:
>>
>>
>
>Oh, well. You seem to sort of have Steven's idea here too ;-) In that case,
>I'd go with the idea of Steven. Make that field a uuid of the host.
>
>
>
>Sincerely,
> Lars Marowsky-Brée <lmb@suse.de>
>
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Nov 23 2002 - 22:00:34 EST