Re: [patch] Converting writeback linked lists to a tree based data structure

From: Fengguang Wu
Date: Wed Jan 16 2008 - 08:07:08 EST


On Wed, Jan 16, 2008 at 12:13:43AM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 18:55:38 +1100 David Chinner <dgc@xxxxxxx> wrote:
>
> > On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> > > On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <wfg@xxxxxxxxxxxxxxxx> wrote:
> > >
> > > > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> > > > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > > > >
> > > > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > > > >
> > > > > This work was done before Fengguang's patches. I am trying to test
> > > > > Fengguang's for comparison but am having problems with getting mm1 to
> > > > > boot on my systems.
> > > >
> > > > Yeah, they are independent ones. The initial motivation is to fix the
> > > > bug "sluggish writeback on small+large files". Michael introduced
> > > > a new rbtree, and me introduced a new list(s_more_io_wait).
> > > >
> > > > Basically I think rbtree is an overkill to do time based ordering.
> > > > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > > > provides fair queuing between small/large files, and s_more_io_wait
> > > > provides waiting mechanism for blocked inodes.
> > > >
> > > > The time ordered rbtree may delay io for a blocked inode simply by
> > > > modifying its dirtied_when and reinsert it. But it would no longer be
> > > > that easy if it is to be ordered by location.
> > >
> > > What does the term "ordered by location" mean? Attemting to sort inodes by
> > > physical disk address? By using their i_ino as a key?
> > >
> > > That sounds optimistic.
> >
> > In XFS, inode number is an encoding of it's location on disk, so
> > ordering inode writeback by inode number *does* make sense.
>
> This code is mainly concerned with writing pagecache data, not inodes.

Now I tend to agree with you. It is not easy to sort writeback pages
by disk location. There are three main obstacles:
- there can be multiple filesystems on the same disk
- ext3/reiserfs/... make use of blockdev inode, which spans the whole
partition;
- xfs may store inode metadata/data separately;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/