Re: [PATCH 00/16] DRBD: a block device for HA clusters

From: david
Date: Sun May 03 2009 - 11:25:12 EST


On Sun, 3 May 2009, James Bottomley wrote:

On Sun, 3 May 2009, James Bottomley wrote:

Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters

On Sun, 2009-05-03 at 07:36 -0700, david@xxxxxxx wrote:
On Sun, 3 May 2009, James Bottomley wrote:

Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters

On Sat, 2009-05-02 at 22:40 -0700, david@xxxxxxx wrote:
On Sun, 3 May 2009, Willy Tarreau wrote:

On Sat, May 02, 2009 at 09:33:35AM +0200, Bart Van Assche wrote:
On Fri, May 1, 2009 at 10:59 AM, Andrew Morton
<akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
On Thu, 30 Apr 2009 13:26:36 +0200 Philipp Reisner <philipp.reisner@xxxxxxxxxx> wrote:

This is a repost of DRBD

Is it being used anywhere for anything? If so, where and what?

One popular application is to run iSCSI and HA software on top of DRBD
in order to build a highly available iSCSI storage target.

Confirmed, I have several customers who're doing exactly that.

I will also say that there are a lot of us out here who would have a use
for DRDB in our HA setups, but have held off implementing it specificly
because it's not yet in the upstream kernel.

Actually, that's not a particularly strong reason because we already
have an in-kernel replicator that has much of the functionality of drbd
that you could use. The main reason for wanting drbd in kernel is that
it has a *current* user base.

Both the in kernel md/nbd and drbd do sync and async replication with
primary side bitmaps. The main differences are:

* md/nbd can do 1 to N replication,
* drbd can do active/active replication (useful for cluster
filesystems)
* The chunk size of the md/nbd is tunable
* With the updated nbd-tools, current md/nbd can do point in time
rollback on transaction logged secondaries (a BCS requirement)
* drbd manages the mirror state explicitly, md/nbd needs a user
space helper

And probably a few others I forget.

one very big one:

DRDB has better support for dealing with split brain situations and
recovering from them.

I don't really think so. The decision about which (or if a) node should
be killed lies with the HA harness outside of the province of the
replication.

One could argue that the symmetric active mode of drbd allows both nodes
to continue rather than having the harness make a kill decision about
one. However, if they both alter the same data, you get an
irreconcilable data corruption fault which, one can argue, is directly
counter to HA principles and so allowing drbd continuation is arguably
the wrong thing to do.

but the issue is that at the time the failure is taking place, neither
side _knows_ that the other side is running. In fact, they both think that
the other side is dead.

Resolving this is the job of the HA harness, as I said ... the usual
solution being either third node pings or confirmable switchover.

and none of those solutions are failsafe in a distributed environment (in a local environment you can have a race to see which system powers off the other first to ensure that at most one is running, but you can't do that reliably remotely)

with DRDB, when the two sides start talking again they will discover that
they are different and complain, loudly, to the sysadmin that they need
help

The object of HA is to prevent data becoming toast, not to point it out
to the sysadmin after the fact.

it needs to do both

with md/ndb you have the situation where both sides will try to resync to
the other side as soon as the packets can get through. this can end up
corrupting both sides if it's not caught fast enough

Actually, that's just your implementation: md/nbd does nothing to
re-establish the replication, it has to be done by the HA harness after
split brain resolution. What a correct harness would do is to compare
the HA event log and the intent logs to see if there had been activity
to both sides after loss of contact and, if their had, to flag the data
corruption problem and not resume replication.

This corruption situation isn't unique to replication ... any time you
may potentially have allowed both sides to write to a data store, you
get it, that's why it's the job of the HA harness to sort out whether a
split brain happened and what to do about it *first*.

but you can have packets sitting in the network buffers waiting to get to the remote machine, then once the connection is reestablished those packets will go out. no remounting needed., just connectivity restored. (this isn't as bad as if the system tries to re-sync to the temprarily unavailable drive by itself, but it can still corrupt things)

a cluster spread across different locations has problems to face that a cluster within easy cabling distance does not.

DRDB has been extensivly tested and build to survive in the harsher environment. md/ndb is a reasonable approximation for the simple enviornment of two servers in one datacenter, but that doesn't mean that it handles the rest of the possible conditions.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/