Re: [PATCH 02/12] DRBD: activity_log

From: Andi Kleen
Date: Tue Mar 24 2009 - 08:28:28 EST

Next message: Alan Cox: "Re: tg3 driver fails to load firmware (2.6.29)"
Previous message: Andi Kleen: "Re: Linux 2.6.29"
Next in thread: Philipp Reisner: "Re: [PATCH 02/12] DRBD: activity_log"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Philipp Reisner <philipp.reisner@xxxxxxxxxx> writes:
> +
> +/* I do not believe that all storage medias can guarantee atomic
> + * 512 byte write operations. When the journal is read, only
> + * transactions with correct xor_sums are considered.
> + * sizeof() = 512 byte */
> +struct __attribute__((packed)) al_transaction {
> + u32 magic;
> + u32 tr_number;
> + /* u32 tr_generation; TODO */

It would be difficult to "TODO" this because adding that field here would
break the complete disk format, wouldn't it?

> + struct __attribute__((packed)) {
> + u32 pos;
> + u32 extent; } updates[1 + AL_EXTENTS_PT];
> + u32 xor_sum;
> +};
> + ok = bio_flagged(bio, BIO_UPTODATE) && md_io.error == 0;
> +
> + /* check for unsupported barrier op.
> + * would rather check on EOPNOTSUPP, but that is not reliable.
> + * don't try again for ANY return value != 0 */
> + if (unlikely(bio_barrier(bio) && !ok)) {

That's a good example for some code that shouldn't be in upstream. If
EOPNOTSUPP for barriers is really not reliable somewhere please just
fix that somewhere (if it's still true and not some ancient bug), not
add workarounds like this.

> +int drbd_md_sync_page_io(struct drbd_conf *mdev, struct drbd_backing_dev *bdev,
> + sector_t sector, int rw)
> +{
> + int hardsect, mask, ok;
> + int offset = 0;
> + struct page *iop = mdev->md_io_page;
> +
> + D_ASSERT(mutex_is_locked(&mdev->md_io_mutex));
> +
> + if (!bdev->md_bdev) {
> + if (DRBD_ratelimit(5*HZ, 5)) {

The kernel has standard functions for this, no need for own macros.

> + ERR("bdev->md_bdev==NULL\n");
> + dump_stack();
> + }

And a rate limited dump_stack seems weird anyways.

> + return 0;
> + }
> +
> + hardsect = drbd_get_hardsect(bdev->md_bdev);
> + if (hardsect == 0)
> + hardsect = MD_HARDSECT;
> +
> + /* in case hardsect != 512 [ s390 only? ] */
> + if (hardsect != MD_HARDSECT) {
> + if (!mdev->md_io_tmpp) {
> + struct page *page = alloc_page(GFP_NOIO);

At least the conventional wisdom is still that block devices should
use mempools, not alloc_page even with NOIO, otherwise they might
not write out in all lowmem situations. There's been some VM work
to address this, but so far nobody was sure that it is sufficient.

> + if (!page)
> + return 0;

So you get a IO error or what happens here on out of memory?

> +
> + drbd_WARN("Meta data's bdev hardsect = %d != %d\n",
> + hardsect, MD_HARDSECT);
> + drbd_WARN("Workaround engaged (has performace impact).\n");
> +
> + mdev->md_io_tmpp = page;
> + }
> +
> + mask = (hardsect / MD_HARDSECT) - 1;
> + D_ASSERT(mask == 1 || mask == 3 || mask == 7);
> + D_ASSERT(hardsect == (mask+1) * MD_HARDSECT);
> + offset = sector & mask;
> + sector = sector & ~mask;
> + iop = mdev->md_io_tmpp;
> +
> + if (rw == WRITE) {
> + void *p = page_address(mdev->md_io_page);
> + void *hp = page_address(mdev->md_io_tmpp);

What happens when the page is in highmem?

> + al_work.old_enr = al_ext->lc_number;
> + al_work.w.cb = w_al_write_transaction;
> + drbd_queue_work_front(&mdev->data.work, &al_work.w);
> + wait_for_completion(&al_work.event);
> +
> + mdev->al_writ_cnt++;
> +
> + spin_lock_irq(&mdev->al_lock);
> + lc_changed(mdev->act_log, al_ext);
> + spin_unlock_irq(&mdev->al_lock);
> + wake_up(&mdev->al_wait);

The wake_up outside the lock looks a little dangerous.

> +int
> +w_al_write_transaction(struct drbd_conf *mdev, struct drbd_work *w, int unused)
> +{
> + struct update_al_work *aw = (struct update_al_work *)w;
> + struct lc_element *updated = aw->al_ext;
> + const unsigned int new_enr = aw->enr;
> + const unsigned int evicted = aw->old_enr;
> +
> + struct al_transaction *buffer;
> + sector_t sector;
> + int i, n, mx;
> + unsigned int extent_nr;
> + u32 xor_sum = 0;
> +
> + if (!inc_local(mdev)) {
> + ERR("inc_local() failed in w_al_write_transaction\n");
> + complete(&((struct update_al_work *)w)->event);
> + return 1;
> + }
> + /* do we have to do a bitmap write, first?
> + * TODO reduce maximum latency:
> + * submit both bios, then wait for both,
> + * instead of doing two synchronous sector writes. */
> + if (mdev->state.conn < Connected && evicted != LC_FREE)
> + drbd_bm_write_sect(mdev, evicted/AL_EXT_PER_BM_SECT);
> +
> + mutex_lock(&mdev->md_io_mutex); /* protects md_io_buffer, al_tr_cycle, ... */

Doing checksumming inside a lock looks nasty.

Didn't read further. It's a lot of code. This was not a complete review,
just some quick comments.

-Andi
--
ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alan Cox: "Re: tg3 driver fails to load firmware (2.6.29)"
Previous message: Andi Kleen: "Re: Linux 2.6.29"
Next in thread: Philipp Reisner: "Re: [PATCH 02/12] DRBD: activity_log"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]