Re: ext2 write performance regression from 2.6.32

From: Feng Tang
Date: Thu Feb 17 2011 - 20:52:59 EST


Hi Jan,

On Thu, 17 Feb 2011 18:32:14 +0800
Jan Kara <jack@xxxxxxx> wrote:

> On Thu 17-02-11 14:08:46, Feng Tang wrote:
> > On Wed, 16 Feb 2011 22:35:30 +0800
> > Jan Kara <jack@xxxxxxx> wrote:
> > > On Wed 16-02-11 17:40:31, Feng Tang wrote:
> > > > Hi,
> > > >
> > > > I made out a debug patch which try to delay the pure FS metadata
> > > > writeback (maxim 30 seconds to match current writeback expire
> > > > time). It works for me on 2.6.32, and the dd performance is
> > > > restored.
> > > >
> > > > Please help to review it, thanks!
> > > >
> > > > btw, I've sent out the block dump info requested by Jan Kara,
> > > > but didn't see it on LKML, so attached them again.
> > > >
> > > > - Feng
> > > >
> > > > From c35548c7d0c3a334d24c26adab277ef62b9825db Mon Sep 17
> > > > 00:00:00 2001 From: Feng Tang <feng.tang@xxxxxxxxx>
> > > > Date: Wed, 16 Feb 2011 17:27:36 +0800
> > > > Subject: [PATCH] writeback: delay the file system metadata
> > > > writeback in 30 seconds
> > > >
> > > > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
> > > > ---
> > > > fs/fs-writeback.c | 10 ++++++++++
> > > > 1 files changed, 10 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 9d5360c..418fd9e 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -635,6 +635,16 @@ static void writeback_inodes_wb(struct
> > > > bdi_writeback *wb, continue;
> > > > }
> > > >
> > > > + if ((wbc->sync_mode != WB_SYNC_ALL)
> > > > + && !inode->i_ino
> > > > + && !strcmp(inode->i_sb->s_id, "bdev"))
> > > > {
> > > > + if (inode->dirtied_when + 30 * HZ >
> > > > jiffies) {
> > > > + list_move(&inode->i_list,
> > > > &wb->b_dirty);
> > > > + continue;
> > > > + }
> > > > + }
> > > > +
> > > > +
> > > Doh, this is a crude hack! Nice for debugging but no way to get
> > > this into the kernel. We have to find a cleaner way to speedup the
> > > writeback...
> >
> > I just tested you 5 writeback patches, and they don't fix the
> > problem, the FS metadata are still periodically written back every
> > one or two seconds. Attached the block dump on 2.6.37 + your
> > patches.
> Yes, so I didn't expect the writeout of metadata will disappear, just
> the IO pattern should be better. So you didn't observe any change in
> throughput with my patches vs without them?

With your patches, it did bring some improvements, like from 7 MB/s to
7.7 MB/s, but not restore back to 10 MB/s. Kyle Liu, who is the original
reporter of this issue also see similar results for your patch.

Anyway, I really hope the IO-less patch either from you or Fengguang
can be merged, which will significantly improve the writeback.

>
> Looking at the block trace, we write about 8 MB of data before doing
> metadata writeback. That's about the best what I'd expect with current
> writeback settings so things worked as expected.

Yes, it is

>
> Hmm, but probably flash card is simple and does not have any
> Flash Translation Layer and so each write of one metadata block costs
> us as a rewrite of the whole erase block which may well be in a MB
> range? That would explain it. Raising MAX_WRITEBACK_PAGES would help
> here but given the throughput of the flash card, the fairness of
> writeback when there are more inodes to write would be already rather
> bad so that's not a good solution either.
>
> > And ye, I agree my patch is kinds of hacky, but its main point is
> > to delay the file system metadata writeback (no longer than current
> > writeback expiration limit: 30 seconds) to make the normal data
> > pages writeback as sequential as possible. Could we go on tuning it
> > in this direction?
> Apart from my aesthetical feelings, the patch brings also some
> technical difficulties. For example if lots of dirtying happens
> against device inode (metadata heavy workload), we need to push out
> dirty pages from the device inode to clean dirty memory. Otherwise
> the processes would just stall in balance_dirty_pages() for 30 s
> waiting for pages to get cleaned. Another issue is that under heavier
> load, the inode will get redirtied while you are writing metadata.
> Thus i_dirtied_when need not change. Finally, if the load is not so
> trivial like your single dd write, but there happens also some other
> activity in the filesystem (like syslog writing to the device once in
> a while or so), then your large dd write will be mixed with small
> writes to other files anyway and thus performance will degrade.
>
> That being said I don't see an easy solution to your problem. The
> fact that 2.6.30 didn't write metadata was more a bug than a feature
> but happened to work good for your flash card. A solution that comes
> to my mind is that we could have a "write chunk size" parameter in
> each BDI (it would make sense not only for flash devices but also for
> RAID) and writeback would be aware that written amount is always
> rounded up to "write chunk size". This ignores filesystem
> fragmentation but that might be dealt with. And if we are doing well
> with cleaning pages, we may skip writes that have small
> cleaned_pages / write_chunk_size ratio. But we'd have to be careful
> not to delay such "inprofitable" writes for too long. So it's not so
> easy to implement this.
>

Thanks for the detailed analysis, that patch can't handle heavy metadata
dirtier case. I'll try to do more check

Thanks,
Feng

> Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/