Re: [PATCH 1/7] block: Add block_flush_device()

From: Chris Mason
Date: Tue Mar 31 2009 - 09:21:29 EST


On Mon, 2009-03-30 at 16:52 -0400, Mark Lord wrote:
> Jens Axboe wrote:
> > On Mon, Mar 30 2009, Linus Torvalds wrote:
> >>
> >> On Mon, 30 Mar 2009, Jens Axboe wrote:
> >>> Sorry, I just don't see much point to doing it this way instead. So now
> >>> the fs will have to check a queue bit after it has issued the flush, how
> >>> is that any better than having the 'error' returned directly?
> >> No.
> >>
> >> Now the fs SHOULD NEVER CHECK AT ALL.
> >>
> >> Either it did the ordering, or the FS cannot do anything about it.
> >>
> >> That's the point. EOPNOTSUPP is n ot a useful error message. You can't
> >> _do_ anything about it.
> >
> > My point is that some file systems may or may not have different paths
> > or optimizations depending on whether barriers are enabled and working
> > or not. Apparently that's just reiserfs and Chris says we can remove it,
> > so it is probably a moot point.
> ..
>
> XFS appears to have something along those lines.
> I believe it tries to disable the drive write caches
> if it discovers that it cannot do cache flushes.
>

If we get EOPNOTSUPP back from a submit_bh/submit_bio, the IO didn't
happen. So, all the filesystems have code to try again without the
barrier flag, and then stop doing barriers from then on.

I'm not saying this is a good or bad API, just explaining for this one
example how it is being used today ;)

> I'll check next time my MythTV box boots up.
> It has a RAID0 under XFS, and the md raid0 code doesn't
> appear to pass the cache flushes to libata for raid0,
> so XFS complains and tries to turn off the write caches.
>
>
> And I have a script to damn well turn them back ON again
> after it does so. Stupid thing tries to override user policy again.
>

XFS does print a warning about not doing barriers any more, but the
write cache should still be on. Especially with MD in front of it, the
storage stack is pretty complex, a mounted filesystem would have a hard
time knowing where to start to turn off write caches on each drive in
the stack.

You can test this pretty easily:

dd if=/dev/zero of=foo bs=4k count=10000 oflag=direct

If that runs faster than 1MB/s the write cache is still on.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/