Re: 2.6.12-rc2 + rc3: reaim with ext3 - system stalls.

From: Badari Pulavarty
Date: Thu May 05 2005 - 13:14:03 EST


Hi Andrew & Jan Kara,

I have an extremely simple patch to fix the problem.
Basically, move the journal_start_handle from writepages()
to ext3_writepages_get_block() (which gets called after locking
the page). What am I missing here ? The patch can't
be that simple :(

Please let me know.

BTW, I haven't done enough testing with the patch yet.

Thanks,
Badari

On Wed, 2005-05-04 at 09:02, Cliff White wrote:
> >
> > --=-LBuZsEP+un6vb/N1S6Yo
> > Content-Type: text/plain
> > Content-Transfer-Encoding: 7bit
> >
> > On Tue, 2005-05-03 at 07:43, Jan Kara wrote:
> > > Hello,
> > >
> > > > Started seeing some odd behaviour with recent kernels, haven't been able
> > to
> > > > run it down, could use some suggestions/help.
> > > >
> > > > Running re-aim7 with 2.6.12-rc2 and rc3, if I use xfs, jfs, or
> > > > reiserfs things work just fine.
> > > >
> > > > With ext3, the test stalls, such that:
> > > > CPU is 50% idle, 50% waiting IO (top)
> > > > vmstat shows one process blocked wio
> > > I've looked through your dumps and I spotted where is the problem -
> > > it's our well known and beloved lock inversion between PageLock and
> > > transaction start (giving CC to Badari who's the author of the patch
> > > that introduced it AFAIK).
> >
> > Yuck. It definitely not intentional.
> >
> > > The correct order is: first get PageLock and *then* start transaction.
> > > But in ext3_writeback_writepages() first ext3_journal_start() is called
> > > and then __mpage_writepages is called that tries to do LockPage and
> > > deadlock is there. Badari, could you please fix that (sadly I think that
> > > would not be easy)? Maybe we should back out those changes until it gets
> > > fixed...
> >
> > Hmm.. let me take a closer look. You are right, its not going to be
> > simple fix.
> >
> > Cliff, here is the patch to backout writepages() for ext3. Can you
> > verify that problems goes away with this patch ?
> >
> I can now verify that the problem is not there with Badari's no-writepages
> patch - all the runs last night completed, performance is about what it was
> for 2.6.12-rc1.
> cliffw

--- linux-2.6.12-rc3.org/fs/ext3/inode.c 2005-05-02 22:28:30.000000000 -0700
+++ linux-2.6.12-rc3/fs/ext3/inode.c 2005-05-05 00:09:59.000000000 -0700
@@ -869,6 +869,14 @@ get_block:
static int ext3_writepages_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create)
{
+ handle_t *handle = journal_current_handle();
+
+ if (!handle) {
+ handle = ext3_journal_start(inode,
+ ext3_writepage_trans_blocks(inode));
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ }
return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create);
}

@@ -1360,24 +1368,10 @@ ext3_writeback_writepages(struct address
handle_t *handle = NULL;
int err, ret = 0;

- if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
- return ret;
-
- handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- return ret;
- }
-
ret = __mpage_writepages(mapping, wbc, ext3_writepages_get_block,
ext3_writeback_writepage_helper);

- /*
- * Need to reaquire the handle since ext3_writepages_get_block()
- * can restart the handle
- */
handle = journal_current_handle();
-
err = ext3_journal_stop(handle);
if (!ret)
ret = err;