Re: writeout stalls in current -git

From: David Chinner
Date: Tue Nov 06 2007 - 21:14:15 EST


On Wed, Nov 07, 2007 at 10:31:14AM +1100, David Chinner wrote:
> On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote:
> > On 11/6/07, David Chinner <dgc@xxxxxxx> wrote:
> > > Rather than vmstat, can you use something like iostat to show how busy your
> > > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue.
> >
> > Both "vmstat 10" and "iostat -x 10" output from this test:
> > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id wa
> > 2 0 0 3700592 0 85424 0 0 31 83 108 244 2 1 95 1
> > -> emerge reads something, don't knwo for sure what...
> > 1 0 0 3665352 0 87940 0 0 239 2 343 585 2 1 97 0
> ....
> >
> > The last 20% of the btrace look more or less completely like this, no
> > other programs do any IO...
> >
> > 253,0 3 104626 526.293450729 974 C WS 79344288 + 8 [0]
> > 253,0 3 104627 526.293455078 974 C WS 79344296 + 8 [0]
> > 253,0 1 36469 444.513863133 1068 Q WS 154998480 + 8 [xfssyncd]
> > 253,0 1 36470 444.513863135 1068 Q WS 154998488 + 8 [xfssyncd]
> ^^
> Apparently we are doing synchronous writes. That would explain why
> it is slow. We shouldn't be doing synchronous writes here. I'll see if
> I can reproduce this.
>
> <goes off and looks>
>
> Yes, I can reproduce the sync writes coming out of xfssyncd. I'll
> look into this further and send a patch when I have something concrete.

Ok, so it's not synchronous writes that we are doing - we're just
submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
The "synchronous" nature appears to be coming from higher level
locking when reclaiming inodes (on the flush lock). It appears that
inode write clustering is failing completely so we are writing the
same block multiple times i.e. once for each inode in the cluster we
have to write.

This must be a side effect of some other change as we haven't
changed anything in the reclaim code recently.....

/me scurries off to run some tests

Indeed it is. The patch below should fix the problem - the inode
clusters weren't getting set up properly when inodes were being
read in or allocated. This is a regression, introduced by this
mod:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=da353b0d64e070ae7c5342a0d56ec20ae9ef5cfb

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
fs/xfs/xfs_iget.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c 2007-11-02 13:44:46.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100
@@ -248,7 +248,7 @@ finish_inode:
icl = NULL;
if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq,
first_index, 1)) {
- if ((iq->i_ino & mask) == first_index)
+ if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index)
icl = iq->i_cluster;
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/