Re: [patch 8/8] fs: add i_op->sync_inode

From: Christoph Hellwig
Date: Thu Jan 06 2011 - 15:45:22 EST


> > The problem is that currently we almost never do a pure blocking
> > ->write_inode. The way the sync code is written we always do a
> > non-blocking one first, then a blocking one. If you always do the
> > synchronous one we'll get a lot more overhead - the first previous
> > asynchronous one will write the inode (be it just into the log, or for
> > real), then we write back data, and then we'll have to write it again
> > becaus it has usually been redirtied again due to the data writeback in
> > the meantime.
>
> It doesn't matter, the integrity still has to be enforced in .sync_fs,
> because sync .write_inode may *never* get called, because of the fact
> that async .write_inode also clears the inode metadata dirty bits.
>
> So if .sync_fs has to enforce integrity anyway, then you don't ever need
> to do any actual waiting in your sync .write_inode. See?

I'm not talking about the actual waiting. Right now we have two
different use cases for ->write_inode:

1) sync_mode == WB_SYNC_NONE

This tells the filesystem to start an opportunistic writeout.

2) sync_mode == WB_SYNC_ALL

This tells the filesystem it needs to to a mandatory writeout.

Note that writeout is losely defined. If a filesystems isn't
exportable or implements the commit_metadata operation it's indeed
enough to synchronize the state into internal fs data just enough for
->sync_fs.

Or that's how it should be. As you pointed out the way the writeback
code treats the WB_SYNC_NONE writeouts makes this not work as expected.

There's various ways to fix this:

1) the one you advocate, that is treating all ->write_inode calls as
if they were WB_SYNC_ALL. This does fix the issue of incorrectly
updating the dirty state, but causes a lot of additional I/O -
the way the sync process is designed we basically always call
->write_inode with WB_SYNC_NONE first, and then with WB_SYNC_ALL
2) keep the WB_SYNC_NONE calls, but never update dirty state for them.
This also fixes the i_dirty state updates, but allows filesystems
that keep internal dirty state to be smarted about avoiding I/O
3) remove the calls to ->write_inode with WB_SYNC_NONE. This might
work well for calls from the sync() callchain, but we rely on
inode background writeback from the flusher threads in lots of
places. Note that we really do not want to block the flusher
threads with blocking writes, which is another argument against
(1).
4) require ->write_inode to update the dirty state itself after
the inode is on disk or in a data structure caught by ->sync_fs.
This keeps optimal behaviour, but requires a lot of code changes.

If we want a quick fix only (2) seems feasibly to me, with the option
of implementing (4) and parts of (3) later on.

> > We need to propagate the VFS dirty state into the fs-internal state,
> > e.g. for XFS start a transaction. The reason for that is that the VFS
> > simply writes timestamps into the inode and marks it dirty instead of
> > telling the filesystem about timestamp updates. For XFS in
> > 2.6.38+ timestamp updates and i_size updates are the only unlogged
> > metadata changes, and thus now the only thing going through
> > ->write_inode.
>
> Well then you have a bug, because a sync .write_inode *may never get
> called*. You may only get an async one, even in the case of fsync,
> because async writeback clears the vfs dirty bits.

Yes, the bug about updating the dirty state for WB_SYNC_NONE affects
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/