Re: clustered MD

From: Neil Brown
Date: Wed Jun 10 2015 - 18:50:59 EST


On Wed, 10 Jun 2015 16:07:44 -0500
David Teigland <teigland@xxxxxxxxxx> wrote:

> On Thu, Jun 11, 2015 at 06:31:31AM +1000, Neil Brown wrote:
> > What is your interest in this? I'm always happy for open discussion and
> > varied input, but it would help to know to what extent you are a stake
> > holder?
>
> Using the dlm correctly is non-trivial and should be reviewed.
> If the dlm is misused, some part of that may fall in my lap, if
> only so far as having to debug problems to distinguish between dlm
> bugs or md-cluster bugs. This has been learned the hard way.
>
> I have yet to find time to look up the previous review discussion.
> I will be more than happy if I find the dlm usage has already been
> thoroughly reviewed.

The DLM usage is the part that I am least comfortable with and I would
certainly welcome review. There was a recent discussion of some issue that I
haven't had a chance to go over yet, but apart from that it has mostly just
been a few developers trying to figure out what we need and how that can be
implemented.

There are (as I recall) two main aspects of the DLM usage.
One is fairly idiomatic locking of the multiple write-intend bitmaps.
Each bitmap can be "active" or "idle". When "idle" all bits are clear.
When "active", one node will usually have an exclusive lock. If/when that node
dies, all other nodes must find out and at least one takes remedial action.
Once the remedial action is taken the bitmap becomes idle. In that state
a new node can claim it. When that happens all other nodes must find out so
they transition to the "watching an active bitmap" state.
This seems to fit well with the shared/exclusive reclaimable locks of DLM.

The other usage is to provide synchronous broadcast message passing between
nodes. When one nodes makes a configuration change it needs to tell all other
nodes and wait for them to acknowledge before the change (such as adding a
spare) is committed. There is a small collections of locks which represent
different states in a broadcast/acknowledge protocol.
This is the part I'm least confident of, but it seems to make sense and seems
to work.


Separately:

> Reading those messages again I see what you mean, they don't sound very
> nice, so sorry about that. I'll repeat the one positive note, which is
> that the brief things I've noticed make it look much better than the dm
> approach from several years ago.

Thanks :-)
In part this effort is a response to "clvm" - which is a completely adequate
solution of clustering when you just need volume management (growing and
shrinking and striping volumes) but doesn't extend very well to RAID.

Look forward to any review comments you find time for:-)

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/