Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first

From: Jan Kara
Date: Tue Sep 29 2009 - 12:13:46 EST

Next message: Linus Torvalds: "Re: [PATCH] Remove pty_ops_bsd and pty_bsd_ioctl() as they're notused"
Previous message: Andy Isaacson: "Re: hard lockup, followed by ext4_lookup: deleted inode referenced: 524788"
In reply to: Andrew Morton: "Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first"
Next in thread: Chris Mason: "Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun 27-09-09 10:10:35, Andrew Morton wrote:
> On Sun, 27 Sep 2009 18:55:14 +0200 Jens Axboe <jens.axboe@xxxxxxxxxx> wrote:
>
> > > I wasn't referring to this patch actually. The code as it stands in
> > > Linus's tree right now attempts to write back up to 2^63 pages...
> >
> > I agree, it could make the fs sync take a looong time. This is not a new
> > issue, though.
>
> It _should_ be a new issue. The old code would estimate the number of
> dirty pages up-front and would then add a +50% fudge factor, so if we
> started the sync with 1GB dirty memory, we write back a max of 1.5GB.
>
> However that might have got broken.
>
> void sync_inodes_sb(struct super_block *sb, int wait)
> {
> struct writeback_control wbc = {
> .sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE,
> .range_start = 0,
> .range_end = LLONG_MAX,
> };
>
> if (!wait) {
> unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
> unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
>
> wbc.nr_to_write = nr_dirty + nr_unstable +
> (inodes_stat.nr_inodes - inodes_stat.nr_unused);
> } else
> wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */
>
> sync_sb_inodes(sb, &wbc);
> }
>
> a) the +50% isn't there in 2.6.31
>
> b) the wait=true case appears to be vulnerable to livelock in 2.6.31.
>
> whodidthat
>
> 38f21977663126fef53f5585e7f1653d8ebe55c4 did that back in January.
Yes, I even remember the (long) discussion:
http://lkml.org/lkml/2008/9/22/361
The problem with using .nr_to_write in WB_SYNC_ALL mode is that
write_cache_pages() can return before it actually wrote the pages we
wanted to write (because someone dirtied other pages in the mean time).
So Nick rather removed the .nr_to_write thing to guarantee proper data
integrity semantics. I'm not sure what is the status of the fixes of
livelock he proposed (added him to CC).
Maybe we could at least fix some cases of possible livelock by setting
.range_end to current i_size when calling writepages(). That would solve a
common case when somebody is extending the file. For the case when someone
writes to a huge hole, we would at least provide a finite upper bound,
although practically it might be just the same as without .range_end.

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Linus Torvalds: "Re: [PATCH] Remove pty_ops_bsd and pty_bsd_ioctl() as they're notused"
Previous message: Andy Isaacson: "Re: hard lockup, followed by ext4_lookup: deleted inode referenced: 524788"
In reply to: Andrew Morton: "Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first"
Next in thread: Chris Mason: "Re: [PATCH] bdi_sync_writeback should WB_SYNC_NONE first"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]