Re: [PATCH] writeback: reset inode dirty time when adding it backto empty s_dirty list

From: Wu Fengguang
Date: Tue Mar 24 2009 - 22:26:46 EST


Hi Ian,

On Tue, Mar 24, 2009 at 11:04:11PM +0800, Ian Kent wrote:
> Jeff Layton wrote:
> > On Tue, 24 Mar 2009 10:28:06 -0400
> > Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >
> >> On Tue, 24 Mar 2009 21:57:20 +0800
> >> Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> >>
> >>> Hi Jeff,
> >>>
> >>> On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
> >>>> This may be a problem on other filesystems too, but the reproducer I
> >>>> have involves NFS.
> >>>>
> >>>> On NFS, the __mark_inode_dirty() call after writing back the inode is
> >>>> done in the rpc_release handler for COMMIT calls. This call is done
> >>>> asynchronously after the call completes.
> >>>>
> >>>> Because there's no real coordination between __mark_inode_dirty() and
> >>>> __sync_single_inode(), it's often the case that these two calls will
> >>>> race and __mark_inode_dirty() will get called while I_SYNC is still set.
> >>>> When this happens, __sync_single_inode() should detect that the inode
> >>>> was redirtied while we were flushing it and call redirty_tail() to put
> >>>> it back on the s_dirty list.
> >>>>
> >>>> When redirty_tail() puts it back on the list, it only resets the
> >>>> dirtied_when value if it's necessary to maintain the list order. Given
> >>>> the right situation (the right I/O patterns and a lot of luck), this
> >>>> could result in dirtied_when never getting updated on an inode that's
> >>>> constantly being redirtied while pdflush is writing it back.
> >>>>
> >>>> Since dirtied_when is based on jiffies, it's possible for it to persist
> >>>> across 2 sign-bit flips of jiffies. When that happens, the time_after()
> >>>> check in sync_sb_inodes no longer works correctly and writeouts by
> >>>> pdflush of this inode and any inodes after it on the list stop.
> >>>>
> >>>> This patch fixes this by resetting the dirtied_when value on an inode
> >>>> when we're adding it back onto an empty s_dirty list. Since we generally
> >>>> write inodes from oldest to newest dirtied_when values, this has the
> >>>> effect of making it so that these inodes don't end up with dirtied_when
> >>>> values that are frozen.
> >>>>
> >>>> I've also taken the liberty of fixing up the comments a bit and changed
> >>>> the !time_after_eq() check in redirty_tail to be time_before(). That
> >>>> should be functionally equivalent but I think it's more readable.
> >>>>
> >>>> I wish this were just a theoretical problem, but we've had a customer
> >>>> hit a variant of it in an older kernel. Newer upstream kernels have a
> >>>> number of changes that make this problem less likely. As best I can tell
> >>>> though, there is nothing that really prevents it.
> >>>>
> >>>> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> >>>> ---
> >>>> fs/fs-writeback.c | 22 +++++++++++++++++-----
> >>>> 1 files changed, 17 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>>> index e3fe991..bd2a7ff 100644
> >>>> --- a/fs/fs-writeback.c
> >>>> +++ b/fs/fs-writeback.c
> >>>> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> >>>> * furthest end of its superblock's dirty-inode list.
> >>>> *
> >>>> * Before stamping the inode's ->dirtied_when, we check to see whether it is
> >>>> - * already the most-recently-dirtied inode on the s_dirty list. If that is
> >>>> - * the case then the inode must have been redirtied while it was being written
> >>>> - * out and we don't reset its dirtied_when.
> >>>> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> >>>> + * list. If that is the case then we don't need to restamp it to maintain the
> >>>> + * order of the list.
> >>>> + *
> >>>> + * If s_dirty is empty however, then we need to go ahead and update
> >>>> + * dirtied_when for the inode. Not doing so will mean that inodes that are
> >>>> + * constantly being redirtied can end up with "stuck" dirtied_when values if
> >>>> + * they happen to consistently be the first one to go back on the list.
> >>>> + *
> >>>> + * Since we're using jiffies values in that field, letting dirtied_when grow
> >>>> + * too old will be problematic if jiffies wraps. It may also be causing
> >>>> + * pdflush to flush the inode too often since it'll always look like it was
> >>>> + * dirtied a long time ago.
> >>>> */
> >>>> static void redirty_tail(struct inode *inode)
> >>>> {
> >>>> struct super_block *sb = inode->i_sb;
> >>>>
> >>>> - if (!list_empty(&sb->s_dirty)) {
> >>>> + if (list_empty(&sb->s_dirty)) {
> >>>> + inode->dirtied_when = jiffies;
> >>>> + } else {
> >>>> struct inode *tail_inode;
> >>>>
> >>>> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> >>>> - if (!time_after_eq(inode->dirtied_when,
> >>>> + if (time_before(inode->dirtied_when,
> >>>> tail_inode->dirtied_when))
> >>>> inode->dirtied_when = jiffies;
> >>>> }
> >>> I'm afraid you patch is equivalent to the following one.
> >>> Because once the first inode's dirtied_when is set to jiffies,
> >>> in order to keep the list in order, the following ones (mostly)
> >>> will also be updated. A domino effect.
> >>>
> >>> Thanks,
> >>> Fengguang
> >>>
> >> Good point. One of our other engineers proposed a similar patch
> >> originally. I considered it but wasn't clear whether there could be a
> >> situation where unconditionally resetting dirtied_when would be a
> >> problem. Now that I think about it though, I think you're right...
> >>
> >> So maybe something like the patch below is the right thing to do? Or,
> >> maybe when we believe that the inode was fully cleaned and then
> >> redirtied, we'd just unconditionally stamp dirtied_when. Something like
> >> this maybe?
> >>
> >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> index bd2a7ff..596c96e 100644
> >> --- a/fs/fs-writeback.c
> >> +++ b/fs/fs-writeback.c
> >> @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
> >> * Someone redirtied the inode while were writing back
> >> * the pages.
> >> */
> >> - redirty_tail(inode);
> >> + inode->dirtied_when = jiffies;
> >> + list_move(&inode->i_list, &sb->s_dirty);
> >> } else if (atomic_read(&inode->i_count)) {
> >> /*
> >> * The inode is clean, inuse
> >
> > Hmm...though it is still possible that you could consistently race in
> > such a way that after writepages(), I_DIRTY is never set but the
> > PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
> > to the same problem of a stuck dirtied_when value.
> >
> > So maybe someone can explain to me why we take such great pains to
> > preserve the dirtied_when value when we're putting the inode back on
> > the tail of s_dirty? Why not just unconditionally reset it?
>
> I think that redirty_tail() is the best place for this as it is a
> central location where dirtied_when can be updated. Then all we have to
> worry about is making sure it is called from all the locations needed.
>
> I'm not sure that removing the comment is a good idea (the Wu Fengguang
> patch) but it probably needs to be revised to explain why dirtied_when
> is forcing a rewrite of the list entry times.

The comment basically says "we do not want to reset dirtied_when for
inodes redirtied while being written out". I guess there are two intentions:

- to retry its writeback as soon as possible and avoid long 30s delays;
- to keep a faithful dirtied_when.

The first one is best effort anyway, changing it should not create new
bugs. The second one shall not, either. Due to the very limited use of
dirtied_when.

However, we do have another cheap solution that can retain both of the
two original intentions. The main idea is to introduce a new s_more_io_wait
queue, and convert the current redirty_tail() calls to either
requeue_io_wait() or some completely_dirty_inode().

Thanks,
Fengguang
---
(a not-up-to-date patch)
writeback: introduce super_block.s_more_io_wait

Introduce super_block.s_more_io_wait to park inodes that for some reason cannot
be synced immediately. They will be revisited in the next s_io enqueue time(<=5s).

The new data flow after this patchset:

s_dirty --> s_io --> s_more_io/s_more_io_wait --+
^ |
| |
+----------------------------------+

- to fill s_io:
s_more_io +
s_dirty(expired) +
s_more_io_wait
---> s_io
- to drain s_io:
s_io -+--> clean inodes goto inode_in_use/inode_unused
|
+--> s_more_io
|
+--> s_more_io_wait

Obviously there're no ordering or starvation problems in the queues:
- s_dirty is now a strict FIFO queue
- inode.dirtied_when is only set when made dirty
- once exipired, the dirty inode will stay in s_*io* queues until made clean
- the dirty inodes in s_*io* will be revisted in order, hence small files won't
be starved by big dirty files.

Cc: David Chinner <dgc@xxxxxxx>
Cc: Michael Rubin <mrubin@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Signed-off-by: Fengguang Wu <wfg@xxxxxxxxxxxxxxxx>
---
fs/fs-writeback.c | 19 +++++++++++++++----
fs/super.c | 1 +
include/linux/fs.h | 1 +
3 files changed, 17 insertions(+), 4 deletions(-)

--- linux-mm.orig/fs/fs-writeback.c
+++ linux-mm/fs/fs-writeback.c
@@ -172,6 +172,14 @@ static void requeue_io(struct inode *ino
list_move(&inode->i_list, &inode->i_sb->s_more_io);
}

+/*
+ * The inode should be retried after _sleeping_ for a while.
+ */
+static void requeue_io_wait(struct inode *inode)
+{
+ list_move(&inode->i_list, &inode->i_sb->s_more_io_wait);
+}
+
static void inode_sync_complete(struct inode *inode)
{
/*
@@ -200,7 +208,8 @@ static void move_expired_inodes(struct l

/*
* Queue all expired dirty inodes for io, eldest first:
- * (entrance) => s_dirty inodes
+ * (entrance) => s_more_io_wait inodes
+ * => s_dirty inodes
* => s_more_io inodes
* => remaining inodes in s_io => (dequeue for sync)
*/
@@ -209,13 +218,15 @@ static void queue_io(struct super_block
{
list_splice_init(&sb->s_more_io, &sb->s_io);
move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+ list_splice_init(&sb->s_more_io_wait, &sb->s_io);
}

int sb_has_dirty_inodes(struct super_block *sb)
{
- return !list_empty(&sb->s_dirty) ||
- !list_empty(&sb->s_io) ||
- !list_empty(&sb->s_more_io);
+ return !list_empty(&sb->s_dirty) ||
+ !list_empty(&sb->s_io) ||
+ !list_empty(&sb->s_more_io) ||
+ !list_empty(&sb->s_more_io_wait);
}
EXPORT_SYMBOL(sb_has_dirty_inodes);

--- linux-mm.orig/fs/super.c
+++ linux-mm/fs/super.c
@@ -64,6 +64,7 @@ static struct super_block *alloc_super(s
INIT_LIST_HEAD(&s->s_dirty);
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_more_io);
+ INIT_LIST_HEAD(&s->s_more_io_wait);
INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
--- linux-mm.orig/include/linux/fs.h
+++ linux-mm/include/linux/fs.h
@@ -1012,6 +1012,7 @@ struct super_block {
struct list_head s_dirty; /* dirty inodes */
struct list_head s_io; /* parked for writeback */
struct list_head s_more_io; /* parked for more writeback */
+ struct list_head s_more_io_wait; /* parked for sleep-then-retry */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_files;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/