Re: "Enhanced" MD code avaible for review

From: Scott Long
Date: Wed Mar 17 2004 - 19:26:34 EST


Bartlomiej Zolnierkiewicz wrote:
On Wednesday 17 of March 2004 22:18, Scott Long wrote:
> Jeff Garzik wrote:
> > Justin T. Gibbs wrote:
> > > [ I tried sending this last night from my Adaptec email address and
> > > have yet to see it on the list. Sorry if this is dup for any of you.
> > > ]
> >
> > Included linux-kernel in the CC (and also bounced this post there).
> >
> > > For the past few months, Adaptec Inc, has been working to enhance MD.
> >
> > The FAQ from several corners is going to be "why not DM?", so I would
> > humbly request that you (or Scott Long) re-post some of that rationale
> > here...

This is #1 question so... why not DM? 8)

Regards,
Bartlomiej



The primary feature of any RAID implementation is reliability. Reliability is a surprisingly hard goal. Making sure that your
data is available and trustworthy under real-world scenarios is
a lot harder than it sounds. This has been a significant focus
of ours on MD, and is the primary reason why we chose MD as the
foundation of our work.

Storage is the foundation of everything that you do with your
computer. It needs to work regardless of what happened to your filesystem on the last crash, regardless of whether or not you
have the latest initrd tools, regardless of what rpms you've kept
up to date on, regardless if your userland works, regardless of
what libc you are using this week, etc.

With DM, what happens when your initrd gets accidentally corrupted?
What happens when the kernel and userland pieces get out of sync?
Maybe you are booting off of a single drive and only using DM arrays
for secondary storage, but maybe you're not. If something goes wrong
with DM, how do you boot?

Secondly, our target here is to interoperate with hardware components
that run outside the scope of Linux. The HostRAID or DDF BIOS is
going to create an array using it's own format. It's not going to
have any knowledge of DM config files, initrd, ramfs, etc. However,
the end user is still going to expect to be able to seamlessly install
onto that newly created array, maybe move that array to another system,
whatever, and have it all Just Work. Has anyone heard of a hardware
RAID card that requires you to run OS-specific commands in order to
access the arrays on it? Of course not. The point here is to make
software raid just as easy to the end user.

The third, and arguably most important issue is the need for reliable
error recovery. With the DM model, error recovery would be done in
userland. Errors generated during I/O would be kicked to a userland
app that would then drive the recovery-spare activation-rebuild
sequence. That's fine, but what if something happens that prevents
the userland tool from running? Maybe it was a daemon that became
idle and got swapped out to disk, but now you can't swap it back in
because your I/O is failing. Or maybe it needs to activate a helper
module or read a config file, but again it can't because i/o is
failing. What if it crashes. What if the source code gets out of sync
with the kernel interface. What if you upgrade glibc and it stops
working for whatever unknown reason.

Some have suggested in the past that these userland tools get put into
ramfs and locked into memory. If you do that, then it might as well be
part of the kernel anyways. It's consuming the same memory, if not
more, than the equivalent code in the kernel (likely a lot more since
you'd have to static link it). And you still have the downsides of it
possibly getting out of date with the kernel. So what are the upsides?

MD is not terribly heavy-weight. As a monolithic module of
DDF+ASR+R0+R1 it's about 65k in size. That's 1/2 the size of your
average SCSI driver these days, and no one is advocating putting those
into userland. It just doesn't make sense to sacrifice reliability
for the phantom goal of 'reducing kernel bloat'.

Scott

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/