Re: [PATCH 00/16] DRBD: a block device for HA clusters

From: James Bottomley
Date: Tue May 05 2009 - 10:09:58 EST


On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > When you do asynchronous replication, how do you ensure that implicit
> > > write-after-write dependencies in the stream of writes you get from
> > > the file system above, are not violated on the secondary ?
> >
> > Are you telling me drbd doesn't currently do this?
> >
>
> No I am not. DRBD does exactly this!
> But I am wondering how that is achieved in the MD/NBD stack when running
> in async mode.

The explanation is below.

> The issue is covered since the early days in DRBD, (back in 2000).
> The issue, and the solution we have in DRBD is described in this paper:
>
> http://www.drbd.org/fileadmin/drbd/publications/drbd_paper_for_NLUUG_2001.pdf
>
> > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > fsync).
>
> Is that available in the existing tools ? -- Are the updated tools
> something that will be available in the future ?

It's in the existing.

> Are you telling me md/ndb (async) doesn't currently do this ?

I just described how it doe this ... I don't quite see how that
translates into telling you it doesn't do this.

> > > There might be a disk scheduler on the secondary.
> >
> > There usually is a disk scheduler ... you just have to take the required
> > action to persuade it to preserve ordering ... a simplistic way of doing
> > this is to switch to the noop scheduler.
>
> The issue actually goes further down the stack. Not only the in kernel
> disk scheduler might reorder something, also the driver and finally the
> drive might do so.
>
> What we have in DRBD boils down to:
>
> * We obey all possible write after write dependencies in the stream of
> writes we get from the upper layers. And generate DRBD internal
> reorder barriers for the packet stream.
> * On the secondary node we impose these barriers onto the stream of writes
> submitted to the stack below us by either:
>
> - Let previously submitted write-IO drain before we submit write-IO after
> such an DRBD barrier. (That we have since 2000 or so)
>
> - Additionally issue a blkdev_issue_flush()
>
> - Use write requests with BIO_RW_BARRIER. This method has two advantages:
> We can continue to submit writes after the DRBD internal barrier
> immediately, and the number of requests with BIO_RW_BARRIER can be
> further reduced.
> See section 6 of
> http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
> for more details, and nice illustrations.

THere's a slight error in there ... we don't use ordered tags for
barriers (yet). I don't think it will really matter because the main
domain of ordering problems is the scheduler, which REQ_BARRIER does
cope with, it just means the queue drains for a barrier.

> Unfortunately only high end SAN devices seem to benefit from this
> method. For most in-machine-disk controlers this method does not
> achieve the highest throughput.
>
> Expressed in other words:
> We allow reordering on the secondary node to an extend so that we can
> guarantee that no implicit write-after-write dependencies are violated.
>
> Coming back to the idea of disabling the in Linux IO scheduler. It might
> solve the issue for some devices, but it does not guarantee to solve it.

I think you'll find the dio/fsync method above actually does solve all
of these issues (mainly because it enforces the semantics from top to
bottom in the stack). I agree one could use more elaborate semantics
like you do for drbd, but since the simple ones worked efficiently for
md/nbd, there didn't seem to be much point.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/