Re: 2.6.14-rc2-mm1 - ext3 wedging up

From: Dave Kleikamp
Date: Fri Sep 30 2005 - 19:27:58 EST


On Thu, 2005-09-29 at 00:38 +0200, Andrea Arcangeli wrote:
> On Fri, Sep 23, 2005 at 04:46:19PM -0500, Dave Kleikamp wrote:
> > On Fri, 2005-09-23 at 15:59 -0500, Dave Kleikamp wrote:
> > I'd guess that it's spinning in balance_dirty_pages.
> > /proc/<pid>/future_dirty is 25650 for fsx. It appears that
> > nr_reclaimable is not going to zero for some reason.

I tracked down my problem to a bug in jfs. jfs is explicitly setting
I_DIRTY in the i_state for a special inode that is preventing it from
being put on the s_dirty list. This must have been something I did a
long time ago when I was a newbie. I'm embarrassed I hadn't noticed it
until now.

I still had the problem even with your latest patch, but it's fixed with
this patch to jfs. I haven't yet tried the jfs patch with the earlier
versions of your patch to see if there is really a problem with them.

I don't have anything to say about the original problem reported on
ext3, since I only saw the problem on jfs.

> Even if nr_reclaimable isn't going to zero, eventually the loop should
> break out because pages_written must increase.
>
> So this make me think it might be the nr_unstable that destabilizes it,
> and whatever it is, it is a bug in mainline as well, except it was well
> hidden until now, because the dirty levels never approached zero during
> heavy write-IO like it can happen with this feature enabled.
>
> Basically whatever we account as "reclaimable" must be _written_out_ and
> accounted as well in the "pages_written" otherwise it'll just hang.
> If there's a problem, it shall be a longstanding one.

Yep. My bad.

> Can you try with this new patch that stops accounting "unstable" as
> "reclaimable". It should be possible to flush the dirty pages to disk so
> "nr_dirty" should be safe because they should always increase the
> "pages_written". I'm not sure if this fixes it, but this at least rule
> out the nfs from the equation (perhaps nfs will never be accounted as
> "pages_written" and that would be a possible explanation of the infinite
> loop).
>
> This new update also makes sure to never account rewrites (except for
> reiserfs where it's more difficult to change the code for this).
>
> I tried with fsx (no params) but I couldn't reproduce any problem yet,
> but I've no nfs workload involved in my test box.
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.14-rc1/per-task-predictive-write-throttling-4
>
> thanks for the help!

JFS: jfs should not be playing with i_state

jfs has been explicitly setting i_state |= I_DIRTY on a special inode.
This prevented it from being put on the s_dirty list. Very stupid.

Signed-off-by: Dave Kleikamp <shaggy@xxxxxxxxxxxxxx>

diff --git a/fs/jfs/jfs_dmap.c b/fs/jfs/jfs_dmap.c
--- a/fs/jfs/jfs_dmap.c
+++ b/fs/jfs/jfs_dmap.c
@@ -305,7 +305,6 @@ int dbSync(struct inode *ipbmap)
filemap_fdatawrite(ipbmap->i_mapping);
filemap_fdatawait(ipbmap->i_mapping);

- ipbmap->i_state |= I_DIRTY;
diWriteSpecial(ipbmap, 0);

return (0);
diff --git a/fs/jfs/jfs_imap.c b/fs/jfs/jfs_imap.c
--- a/fs/jfs/jfs_imap.c
+++ b/fs/jfs/jfs_imap.c
@@ -514,8 +514,6 @@ void diWriteSpecial(struct inode *ip, in
ino_t inum = ip->i_ino;
struct metapage *mp;

- ip->i_state &= ~I_DIRTY;
-
if (secondary)
address = addressPXD(&sbi->ait2) >> sbi->l2nbperpage;
else
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -2396,7 +2396,6 @@ static void txUpdateMap(struct tblock *
*/
if (tblk->xflag & COMMIT_CREATE) {
diUpdatePMap(ipimap, tblk->ino, FALSE, tblk);
- ipimap->i_state |= I_DIRTY;
/* update persistent block allocation map
* for the allocation of inode extent;
*/
@@ -2407,7 +2406,6 @@ static void txUpdateMap(struct tblock *
} else if (tblk->xflag & COMMIT_DELETE) {
ip = tblk->u.ip;
diUpdatePMap(ipimap, ip->i_ino, TRUE, tblk);
- ipimap->i_state |= I_DIRTY;
iput(ip);
}
}

--
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/